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There  are  many  books  that  are  excellent  sources  of  knowledge  about 
individual  statistical  tools  (survival  models,  general  linear  models,  etc.),  but 
the  art  of  data  analysis  is  about  choosing  and  using  multiple  tools.  In  the 
words  of  Chatfield  [100,  p.  420]  . .  students  typically  know  the  technical  de¬ 

tails  of  regression  for  example,  but  not  necessarily  when  and  how  to  apply  it. 
This  argues  the  need  for  a  better  balance  in  the  literature  and  in  statistical 
teaching  between  techniques  and  problem  solving  strategies .”  Whether  ana¬ 
lyzing  risk  factors,  adjusting  for  biases  in  observational  studies,  or  developing 
predictive  models,  there  are  common  problems  that  few  regression  texts  ad¬ 
dress.  For  example,  there  are  missing  data  in  the  majority  of  datasets  one  is 
likely  to  encounter  (other  than  those  used  in  textbooks!)  but  most  regression 
texts  do  not  include  methods  for  dealing  with  such  data  effectively,  and  most 
texts  on  missing  data  do  not  cover  regression  modeling. 

This  book  links  standard  regression  modeling  approaches  with 

•  methods  for  relaxing  linearity  assumptions  that  still  allow  one  to  easily 
obtain  predictions  and  confidence  limits  for  future  observations,  and  to  do 
formal  hypothesis  tests, 

•  non-additive  modeling  approaches  not  requiring  the  assumption  that 
interactions  are  always  linear  x  linear, 

•  methods  for  imputing  missing  data  and  for  penalizing  variances  for  incom¬ 
plete  data, 

•  methods  for  handling  large  numbers  of  predictors  without  resorting  to 
problematic  stepwise  variable  selection  techniques, 

•  data  reduction  methods  (unsupervised  learning  methods,  some  of  which 
are  based  on  multivariate  psychometric  techniques  too  seldom  used  in 
statistics)  that  help  with  the  problem  of  “too  many  variables  to  analyze  and 
not  enough  observations”  as  well  as  making  the  model  more  interpretable 
when  there  are  predictor  variables  containing  overlapping  information, 

•  methods  for  quantifying  predictive  accuracy  of  a  fitted  model, 
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•  powerful  model  validation  techniques  based  on  the  bootstrap  that  allow  the 

analyst  to  estimate  predictive  accuracy  nearly  unbiasedly  without  holding 

back  data  from  the  model  development  process,  and 

•  graphical  methods  for  understanding  complex  models. 

On  the  last  point,  this  text  has  special  emphasis  on  what  could  be  called 
“presentation  graphics  for  fitted  models”  to  help  make  regression  analyses 
more  palatable  to  non-statisticians.  For  example,  nomograms  have  long  been 
used  to  make  equations  portable,  but  they  are  not  drawn  routinely  because 
doing  so  is  very  labor-intensive.  An  R  function  called  nomogram  in  the  package 
described  below  draws  nomograms  from  a  regression  fit,  and  these  diagrams 
can  be  used  to  communicate  modeling  results  as  well  as  to  obtain  predicted 
values  manually  even  in  the  presence  of  complex  variable  transformations. 

Most  of  the  methods  in  this  text  apply  to  all  regression  models,  but  special 
emphasis  is  given  to  some  of  the  most  popular  ones:  multiple  regression  using 
least  squares  and  its  generalized  least  squares  extension  for  serial  (repeated 
measurement)  data,  the  binary  logistic  model,  models  for  ordinal  responses, 
parametric  survival  regression  models,  and  the  Cox  semiparametric  survival 
model.  There  is  also  a  chapter  on  nonparametric  transform-both-sides  regres¬ 
sion.  Emphasis  is  given  to  detailed  case  studies  for  these  methods  as  well  as 
for  data  reduction,  imputation,  model  simplification,  and  other  tasks.  Ex¬ 
cept  for  the  case  study  on  survival  of  Titanic  passengers,  all  examples  are 
from  biomedical  research.  However,  the  methods  presented  here  have  broad 
application  to  other  areas  including  economics,  epidemiology,  sociology,  psy¬ 
chology,  engineering,  and  predicting  consumer  behavior  and  other  business 
outcomes. 

This  text  is  intended  for  Masters  or  PhD  level  graduate  students  who 
have  had  a  general  introductory  probability  and  statistics  course  and  who 
are  well  versed  in  ordinary  multiple  regression  and  intermediate  algebra.  The 
book  is  also  intended  to  serve  as  a  reference  for  data  analysts  and  statistical 
methodologists.  Readers  without  a  strong  background  in  applied  statistics 
may  wish  to  first  study  one  of  the  many  introductory  applied  statistics  and 
regression  texts  that  are  available.  The  author’s  course  notes  Biostatistics 
for  Biomedical  Research  on  the  text’s  web  site  covers  basic  regression  and 
many  other  topics.  The  paper  by  Nick  and  Hardin  [476]  also  provides  a  good 
introduction  to  multivariable  modeling  and  interpretation.  There  are  many 
excellent  intermediate  level  texts  on  regression  analysis.  One  of  them  is  by 
Fox,  which  also  has  a  companion  software-based  text  [200,201].  For  readers 
interested  in  medical  or  epidemiologic  research,  Steyerberg’s  excellent  text 
Clinical  Prediction  Models  [586]  is  an  ideal  companion  for  Regression  Modeling 
Strategies.  Steyerberg’s  book  provides  further  explanations,  examples,  and 
simulations  of  many  of  the  methods  presented  here.  And  no  text  on  regression 
modeling  should  fail  to  mention  the  seminal  work  of  John  Nelder  [450]. 

The  overall  philosophy  of  this  book  is  summarized  by  the  following  state¬ 
ments. 
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•  Satisfaction  of  model  assumptions  improves  precision  and  increases  statis¬ 
tical  power. 

•  It  is  more  productive  to  make  a  model  fit  step  by  step  (e.g.,  transformation 
estimation)  than  to  postulate  a  simple  model  and  find  out  what  went 
wrong. 

•  Graphical  methods  should  be  married  to  formal  inference. 

•  Overfitting  occurs  frequently,  so  data  reduction  and  model  validation  are 
important. 

•  In  most  research  projects,  the  cost  of  data  collection  far  outweighs  the  cost 
of  data  analysis,  so  it  is  important  to  use  the  most  efficient  and  accurate 
modeling  techniques,  to  avoid  categorizing  continuous  variables,  and  to 
not  remove  data  from  the  estimation  sample  just  to  be  able  to  validate  the 
model. 

•  The  bootstrap  is  a  breakthrough  for  statistical  modeling,  and  the  analyst 
should  use  it  for  many  steps  of  the  modeling  strategy,  including  deriva¬ 
tion  of  distribution-free  confidence  intervals  and  estimation  of  optimism 
in  model  fit  that  takes  into  account  variations  caused  by  the  modeling 
strategy. 

•  Imputation  of  missing  data  is  better  than  discarding  incomplete  observa¬ 
tions. 

•  Variance  often  dominates  bias,  so  biased  methods  such  as  penalized  max¬ 
imum  likelihood  estimation  yield  models  that  have  a  greater  chance  of 
accurately  predicting  future  observations. 

•  Software  without  multiple  facilities  for  assessing  and  fixing  model  fit  may 
only  seem  to  be  user-friendly. 

•  Carefully  fitting  an  improper  model  is  better  than  badly  fitting  (and  over¬ 
fitting)  a  well-chosen  one. 

•  Methods  that  work  for  all  types  of  regression  models  are  the  most  valuable. 

•  Using  the  data  to  guide  the  data  analysis  is  almost  as  dangerous  as  not 
doing  so. 

•  There  are  benefits  to  modeling  by  deciding  how  many  degrees  of  freedom 
(i.e.,  number  of  regression  parameters)  can  be  “spent,”  deciding  where  they 
should  be  spent,  and  then  spending  them. 

On  the  last  point,  the  author  believes  that  significance  tests  and  P- values 
are  problematic,  especially  when  making  modeling  decisions.  Judging  by  the 
increased  emphasis  on  confidence  intervals  in  scientific  journals  there  is  reason 
to  believe  that  hypothesis  testing  is  gradually  being  de-emphasized.  Yet  the 
reader  will  notice  that  this  text  contains  many  P-values.  How  does  that  make 
sense  when,  for  example,  the  text  recommends  against  simplifying  a  model 
when  a  test  of  linearity  is  not  significant?  First,  some  readers  may  wish  to 
emphasize  hypothesis  testing  in  general,  and  some  hypotheses  have  special 
interest,  such  as  in  pharmacology  where  one  may  be  interested  in  whether 
the  effect  of  a  drug  is  linear  in  log  dose.  Second,  many  of  the  more  interesting 
hypothesis  tests  in  the  text  are  tests  of  complexity  (nonlinearity,  interaction) 
of  the  overall  model.  Null  hypotheses  of  linearity  of  effects  in  particular  are 
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frequently  rejected,  providing  formal  evidence  that  the  analyst’s  investment 
of  time  to  use  more  than  simple  statistical  models  was  warranted. 

The  rapid  development  of  Bayesian  modeling  methods  and  rise  in  their  use 
is  exciting.  Full  Bayesian  modeling  greatly  reduces  the  need  for  the  approxi¬ 
mations  made  for  confidence  intervals  and  distributions  of  test  statistics,  and 
Bayesian  methods  formalize  the  still  rather  ad  hoc  frequentist  approach  to 
penalized  maximum  likelihood  estimation  by  using  skeptical  prior  distribu¬ 
tions  to  obtain  well-defined  posterior  distributions  that  automatically  deal 
with  shrinkage.  The  Bayesian  approach  also  provides  a  formal  mechanism  for 
incorporating  information  external  to  the  data.  Although  Bayesian  methods 
are  beyond  the  scope  of  this  text,  the  text  is  Bayesian  in  spirit  by  emphasizing 
the  careful  use  of  subject  matter  expertise  while  building  statistical  models. 

The  text  emphasizes  predictive  modeling,  but  as  discussed  in  Chapter  1, 
developing  good  predictions  goes  hand  in  hand  with  accurate  estimation  of 
effects  and  with  hypothesis  testing  (when  appropriate).  Besides  emphasis 
on  multivariable  modeling,  the  text  includes  a  Chapter  17  introducing  sur¬ 
vival  analysis  and  methods  for  analyzing  various  types  of  single  and  multiple 
events.  This  book  does  not  provide  examples  of  analyses  of  one  common 
type  of  response  variable,  namely,  cost  and  related  measures  of  resource  con¬ 
sumption.  However,  least  squares  modeling  presented  in  Chapter  15.1,  the 
robust  rank-based  methods  presented  in  Chapters  13,  15,  and  20,  and  the 
transform-both-sides  regression  models  discussed  in  Chapter  16  are  very  ap¬ 
plicable  and  robust  for  modeling  economic  outcomes.  See  [167]  and  [260]  for 
example  analyses  of  such  dependent  variables  using,  respectively,  the  Cox 
model  and  nonparametric  additive  regression.  The  central  Web  site  for  this 
book  (see  the  Appendix)  has  much  more  material  on  the  use  of  the  Cox  model 
for  analyzing  costs. 

This  text  does  not  address  some  important  study  design  issues  that  if  not 
respected  can  doom  a  predictive  modeling  or  estimation  project  to  failure. 
See  Laupacis,  Sekar,  and  Stiell  [378]  for  a  list  of  some  of  these  issues. 

Heavy  use  is  made  of  the  S  language  used  by  R.  R  is  the  focus  because 
it  is  an  elegant  object-oriented  system  in  which  it  is  easy  to  implement  new 
statistical  ideas.  Many  R  users  around  the  world  have  done  so,  and  their  work 
has  benefited  many  of  the  procedures  described  here.  R  also  has  a  uniform 
syntax  for  specifying  statistical  models  (with  respect  to  categorical  predictors, 
interactions,  etc.),  no  matter  which  type  of  model  is  being  fitted  [96]. 

The  free,  open-source  statistical  software  system  R  has  been  adopted  by 
analysts  and  research  statisticians  worldwide.  Its  capabilities  are  growing 
exponentially  because  of  the  involvement  of  an  ever-growing  community  of 
statisticians  who  are  adding  new  tools  to  the  base  R  system  through  con¬ 
tributed  packages.  All  of  the  functions  used  in  this  text  are  available  in  R. 
See  the  book’s  Web  site  for  updated  information  about  software  availability. 

Readers  who  don’t  use  R  or  any  other  statistical  software  environment  will 
still  find  the  statistical  methods  and  case  studies  in  this  text  useful,  and  it  is 
hoped  that  the  code  that  is  presented  will  make  the  statistical  methods  more 
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concrete.  At  the  very  least,  the  code  demonstrates  that  all  of  the  methods 
presented  in  the  text  are  feasible. 

This  text  does  not  teach  analysts  how  to  use  R.  For  that,  the  reader  may 
wish  to  see  reading  recommendations  on  www.r-project . org  as  well  as  Venables 
and  Ripley  [635]  (which  is  also  an  excellent  companion  to  this  text)  and  the 
many  other  excellent  texts  on  R.  See  the  Appendix  for  more  information. 

In  addition  to  powerful  features  that  are  built  into  R,  this  text  uses  a 
package  of  freely  available  R  functions  called  rms  written  by  the  author,  rms 
tracks  modeling  details  related  to  the  expanded  X  or  design  matrix.  It  is  a 
series  of  over  200  functions  for  model  fitting,  testing,  estimation,  validation, 
graphics,  prediction,  and  typesetting  by  storing  enhanced  model  design  at¬ 
tributes  in  the  fit.  rms  includes  functions  for  least  squares  and  penalized  least 
squares  multiple  regression  modeling  in  addition  to  functions  for  binary  and 
ordinal  regression,  generalized  least  squares  for  analyzing  serial  data,  quan¬ 
tile  regression,  and  survival  analysis  that  are  emphasized  in  this  text.  Other 
freely  available  miscellaneous  R  functions  used  in  the  text  are  found  in  the 
Hmisc  package  also  written  by  the  author.  Functions  in  Hmisc  include  facilities 
for  data  reduction,  imputation,  power  and  sample  size  calculation,  advanced 
table  making,  recoding  variables,  importing  and  inspecting  data,  and  general 
graphics.  Consult  the  Appendix  for  information  on  obtaining  Hmisc  and  rms. 

The  author  and  his  colleagues  have  written  SAS  macros  for  fitting  re¬ 
stricted  cubic  splines  and  for  other  basic  operations.  See  the  Appendix  for 
more  information.  It  is  unfair  not  to  mention  some  excellent  capabilities  of 
other  statistical  packages  such  as  Stata  (which  has  also  been  extended  to 
provide  regression  splines  and  other  modeling  tools),  but  the  extendability 
and  graphics  of  R  makes  it  especially  attractive  for  all  aspects  of  the  compre¬ 
hensive  modeling  strategy  presented  in  this  book. 

Portions  of  Chapters  4  and  20  were  published  as  reference  [269].  Some  of 
Chapter  13  was  published  as  reference  [272]. 

The  author  may  be  contacted  by  electronic  mail  at  f.harrell® 
vanderbilt .  edu  and  would  appreciate  being  informed  of  unclear  points,  er¬ 
rors,  and  omissions  in  this  book.  Suggestions  for  improvements  and  for  future 
topics  are  also  welcome.  As  described  in  the  Web  site,  instructors  may  con¬ 
tact  the  author  to  obtain  copies  of  quizzes  and  extra  assignments  (both  with 
answers)  related  to  much  of  the  material  in  the  earlier  chapters,  and  to  obtain 
full  solutions  (with  graphical  output)  to  the  majority  of  assignments  in  the 
text. 

Major  changes  since  the  first  edition  include  the  following: 

1.  Creation  of  a  now  mature  R  package,  rms,  that  replaces  and  greatly  ex¬ 
tends  the  Design  library  used  in  the  first  edition 

2.  Conversion  of  all  of  the  book’s  code  to  R 

3.  Conversion  of  the  book  source  into  knitr  [677]  reproducible  documents 

4.  All  code  from  the  text  is  executable  and  is  on  the  web  site 

5.  Use  of  color  graphics  and  use  of  the  ggplot2  graphics  package  [667] 

6.  Scanned  images  were  re-drawn 
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7.  New  text  about  problems  with  dichotomization  of  continuous  variables 
and  with  classification  (as  opposed  to  prediction) 

8.  Expanded  material  on  multiple  imputation  and  predictive  mean  match¬ 
ing  and  emphasis  on  multiple  imputation  (using  the  Hmisc  areglmpute 
function)  instead  of  single  imputation 

9.  Addition  of  redundancy  analysis 

10.  Added  a  new  section  in  Chapter  5  on  bootstrap  confidence  intervals  for 
rankings  of  predictors 

11.  Replacement  of  the  U.S.  presidential  election  data  with  analyses  of  a  new 
diabetes  dataset  from  NHANES  using  ordinal  and  quantile  regression 

12.  More  emphasis  on  semiparametric  ordinal  regression  models  for  contin¬ 
uous  y,  as  direct  competitors  of  ordinary  multiple  regression,  with  a 
detailed  case  study 

13.  A  new  chapter  on  generalized  least  squares  for  analysis  of  serial  response 
data 

14.  The  case  study  in  imputation  and  data  reduction  was  completely  reworked 
and  now  focuses  only  on  data  reduction,  with  the  addition  of  sparse  prin¬ 
cipal  components 

15.  More  information  about  indexes  of  predictive  accuracy 

16.  Augmentation  of  the  chapter  on  maximum  likelihood  to  include  more 
flexible  ways  of  testing  contrasts  as  well  as  new  methods  for  obtaining 
simultaneous  confidence  intervals 

17.  Binary  logistic  regression  case  study  1  was  completely  re-worked,  now 
providing  examples  of  model  selection  and  model  approximation  accuracy 

18.  Single  imputation  was  dropped  from  binary  logistic  case  study  2 

19.  The  case  study  in  transform-both-sides  regression  modeling  has  been  re¬ 
worked  using  simulated  data  where  true  transformations  are  known,  and 
a  new  example  of  the  smearing  estimator  was  added 

20.  Addition  of  225  references,  most  of  them  published  2001-2014 

21.  New  guidance  on  minimum  sample  sizes  needed  by  some  of  the  models 

22.  De-emphasis  of  bootstrap  bumping  [610]  for  obtaining  simultaneous  con¬ 
fidence  regions,  in  favor  of  a  general  multiplicity  approach  [307]. 
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Chapter  1 

Introduction 


1.1  Hypothesis  Testing,  Estimation,  and  Prediction 

Statistics  comprises  among  other  areas  study  design,  hypothesis  testing, 
estimation,  and  prediction.  This  text  aims  at  the  last  area,  by  presenting 
methods  that  enable  an  analyst  to  develop  models  that  will  make  accurate 
predictions  of  responses  for  future  observations.  Prediction  could  be  consid¬ 
ered  a  superset  of  hypothesis  testing  and  estimation,  so  the  methods  presented 
here  will  also  assist  the  analyst  in  those  areas.  It  is  worth  pausing  to  explain 
how  this  is  so. 

In  traditional  hypothesis  testing  one  often  chooses  a  null  hypothesis  de¬ 
fined  as  the  absence  of  some  effect.  For  example,  in  testing  whether  a  vari¬ 
able  such  as  cholesterol  is  a  risk  factor  for  sudden  death,  one  might  test  the 
null  hypothesis  that  an  increase  in  cholesterol  does  not  increase  the  risk  of 
death.  Hypothesis  testing  can  easily  be  done  within  the  context  of  a  statistical 
model,  but  a  model  is  not  required.  When  one  only  wishes  to  assess  whether 
an  effect  is  zero,  P-values  may  be  computed  using  permutation  or  rank  (non- 
parametric)  tests  while  making  only  minimal  assumptions.  But  there  are  still 
reasons  for  preferring  a  model-based  approach  over  techniques  that  only  yield 
P-values. 

1.  Permutation  and  rank  tests  do  not  easily  give  rise  to  estimates  of  magni¬ 
tudes  of  effects. 

2.  These  tests  cannot  be  readily  extended  to  incorporate  complexities  such 
as  cluster  sampling  or  repeated  measurements  within  subjects. 

3.  Once  the  analyst  is  familiar  with  a  model,  that  model  may  be  used  to  carry 
out  many  different  statistical  tests;  there  is  no  need  to  learn  specific  for¬ 
mulas  to  handle  the  special  cases.  The  two-sample  t-test  is  a  special  case 
of  the  ordinary  multiple  regression  model  having  as  its  sole  X  variable 
a  dummy  variable  indicating  group  membership.  The  Wilcoxon-Mann- 
Whitney  test  is  a  special  case  of  the  proportional  odds  ordinal  logistic 


@  Springer  International  Publishing  Switzerland  2015 

F.E.  Harrell,  Jr.,  Regression  Modeling  Strategies ,  Springer  Series 

in  Statistics,  DOI  10.1007/978-3-319-19425-7_l 


1 


2 


1  Introduction 


model.  The  analysis  of  variance  (multiple  group)  test  and  the  Kruskal- 

Wallis  test  can  easily  be  obtained  from  these  two  regression  models  by 

using  more  than  one  dummy  predictor  variable. 

Even  without  complexities  such  as  repeated  measurements,  problems  can 
arise  when  many  hypotheses  are  to  be  tested.  Testing  too  many  hypotheses 
is  related  to  fitting  too  many  predictors  in  a  regression  model.  One  commonly 
hears  the  statement  that  “the  dataset  was  too  small  to  allow  modeling,  so  we 
just  did  hypothesis  tests.”  It  is  unlikely  that  the  resulting  inferences  would  be 
reliable.  If  the  sample  size  is  insufficient  for  modeling  it  is  often  insufficient 
for  tests  or  estimation.  This  is  especially  true  when  one  desires  to  publish 
an  estimate  of  the  effect  corresponding  to  the  hypothesis  yielding  the  small¬ 
est  P- value.  Ordinary  point  estimates  are  known  to  be  badly  biased  when 
the  quantity  to  be  estimated  was  determined  by  “data  dredging.”  This  can 
be  remedied  by  the  same  kind  of  shrinkage  used  in  multivariable  modeling 
(Section  9.10). 

Statistical  estimation  is  usually  model-based.  For  example,  one  might  use  a 
survival  regression  model  to  estimate  the  relative  effect  of  increasing  choles¬ 
terol  from  200  to  250  mg/dl  on  the  hazard  of  death.  Variables  other  than 
cholesterol  may  also  be  in  the  regression  model,  to  allow  estimation  of  the 
effect  of  increasing  cholesterol,  holding  other  risk  factors  constant.  But  ac¬ 
curate  estimation  of  the  cholesterol  effect  will  depend  on  how  cholesterol  as 
well  as  each  of  the  adjustment  variables  is  assumed  to  relate  to  the  hazard 
of  death.  If  linear  relationships  are  incorrectly  assumed,  estimates  will  be 
inaccurate.  Accurate  estimation  also  depends  on  avoiding  overfitting  the  ad¬ 
justment  variables.  If  the  dataset  contains  200  subjects,  30  of  whom  died,  and 
if  one  adjusted  for  15  “confounding”  variables,  the  estimates  would  be  “over¬ 
adjusted”  for  the  effects  of  the  15  variables,  as  some  of  their  apparent  effects 
would  actually  result  from  spurious  associations  with  the  response  variable 
(time  until  death).  The  over  adjustment  would  reduce  the  cholesterol  effect. 
The  resulting  unreliability  of  estimates  equals  the  degree  to  which  the  overall 
model  fails  to  validate  on  an  independent  sample. 

It  is  often  useful  to  think  of  effect  estimates  as  differences  between  two 
predicted  values  from  a  model.  This  way,  one  can  account  for  nonlinearities 
and  interactions.  For  example,  if  cholesterol  is  represented  nonlinear ly  in  a 
logistic  regression  model,  predicted  values  on  the  “linear  combination  of  X’s 
scale”  are  predicted  log  odds  of  an  event.  The  increase  in  log  odds  from  raising 
cholesterol  from  200  to  250  mg/dl  is  the  difference  in  predicted  values,  where 
cholesterol  is  set  to  250  and  then  to  200,  and  all  other  variables  are  held 
constant.  The  point  estimate  of  the  250:200  mg/dl  odds  ratio  is  the  anti-log 
of  this  difference.  If  cholesterol  is  represented  nonlinearly  in  the  model,  it 
does  not  matter  how  many  terms  in  the  model  involve  cholesterol  as  long  as 
the  overall  predicted  values  are  obtained. 
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Thus  when  one  develops  a  reasonable  multivariable  predictive  model,  hy¬ 
pothesis  testing  and  estimation  of  effects  are  byproducts  of  the  fitted  model. 
So  predictive  modeling  is  often  desirable  even  when  prediction  is  not  the  main 
goal. 


1.2  Examples  of  Uses  of  Predictive  Multivariable 
Modeling 

There  is  an  endless  variety  of  uses  for  multivariable  models.  Predictive  mod¬ 
els  have  long  been  used  in  business  to  forecast  financial  performance  and 
to  model  consumer  purchasing  and  loan  pay-back  behavior.  In  ecology,  re¬ 
gression  models  are  used  to  predict  the  probability  that  a  fish  species  will 
disappear  from  a  lake.  Survival  models  have  been  used  to  predict  product 
life  (e.g.,  time  to  burn-out  of  an  mechanical  part,  time  until  saturation  of  a 
disposable  diaper).  Models  are  commonly  used  in  discrimination  litigation  in 
an  attempt  to  determine  whether  race  or  sex  is  used  as  the  basis  for  hiring 
or  promotion,  after  taking  other  personnel  characteristics  into  account. 

Multivariable  models  are  used  extensively  in  medicine,  epidemiology,  bio¬ 
statistics,  health  services  research,  pharmaceutical  research,  and  related 
fields.  The  author  has  worked  primarily  in  these  fields,  so  most  of  the  ex¬ 
amples  in  this  text  come  from  those  areas.  In  medicine,  two  of  the  major 
areas  of  application  are  diagnosis  and  prognosis.  There  models  are  used  to 
predict  the  probability  that  a  certain  type  of  patient  will  be  shown  to  have  a 
specific  disease,  or  to  predict  the  time  course  of  an  already  diagnosed  disease. 
In  observational  studies  in  which  one  desires  to  compare  patient  outcomes 
between  two  or  more  treatments,  multivariable  modeling  is  very  important 
because  of  the  biases  caused  by  nonrandom  treatment  assignment.  Here  the 
simultaneous  effects  of  several  uncontrolled  variables  must  be  controlled  (held 
constant  mathematically  if  using  a  regression  model)  so  that  the  effect  of  the 
factor  of  interest  can  be  more  purely  estimated.  A  newer  technique  for  more 
aggressively  adjusting  for  nonrandom  treatment  assignment,  the  propensity 
score,116, 530  provides  yet  another  opportunity  for  multivariable  modeling  (see 
Section  10.1.4).  The  propensity  score  is  merely  the  predicted  value  from  a 
multivariable  model  where  the  response  variable  is  the  exposure  or  the  treat¬ 
ment  actually  used.  The  estimated  propensity  score  is  then  used  in  a  second 
step  as  an  adjustment  variable  in  the  model  for  the  response  of  interest. 

It  is  not  widely  recognized  that  multivariable  modeling  is  extremely  valu¬ 
able  even  in  well-designed  randomized  experiments.  Such  studies  are  often 
designed  to  make  relative  comparisons  of  two  or  more  treatments,  using  odds 
ratios,  hazard  ratios,  and  other  measures  of  relative  effects.  But  to  be  able 
to  estimate  absolute  effects  one  must  develop  a  multivariable  model  of  the 
response  variable.  This  model  can  predict,  for  example,  the  probability  that  a 
patient  on  treatment  A  with  characteristics  X  will  survive  five  years,  or  it  can 
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predict  the  life  expectancy  for  this  patient.  By  making  the  same  prediction 
for  a  patient  on  treatment  B  with  the  same  characteristics,  one  can  estimate 
the  absolute  difference  in  probabilities  or  life  expectancies.  This  approach 
recognizes  that  low-risk  patients  must  have  less  absolute  benefit  of  treatment 
(lower  change  in  outcome  probability)  than  high-risk  patients,351  a  fact  that 
has  been  ignored  in  many  clinical  trials.  Another  reason  for  multivariable 
modeling  in  randomized  clinical  trials  is  that  when  the  basic  response  model 
is  nonlinear  (e.g.,  logistic,  Cox,  parametric  survival  models),  the  unadjusted 
estimate  of  the  treatment  effect  is  not  correct  if  there  is  moderate  heterogene¬ 
ity  of  subjects,  even  with  perfect  balance  of  baseline  characteristics  across 
the  treatment  groups. a9, 24, 198, 588  So  even  when  investigators  are  interested 
in  simple  comparisons  of  two  groups’  responses,  multivariable  modeling  can 
be  advantageous  and  sometimes  mandatory. 

Cost-effectiveness  analysis  is  becoming  increasingly  used  in  health  care  re¬ 
search,  and  the  “effectiveness”  (denominator  of  the  cost-effectiveness  ratio) 
is  always  a  measure  of  absolute  effectiveness.  As  absolute  effectiveness  varies 
dramatically  with  the  risk  profiles  of  subjects,  it  must  be  estimated  for  indi¬ 
vidual  subjects  using  a  multivariable  model”0,344. 


1.3  Prediction  vs.  Classification 

For  problems  ranging  from  bioinformatics  to  marketing,  many  analysts  desire 
to  develop  “classifiers”  instead  of  developing  predictive  models.  Consider  an 
optimum  case  for  classifier  development,  in  which  the  response  variable  is 
binary,  the  two  levels  represent  a  sharp  dichotomy  with  no  gray  zone  (e.g., 
complete  success  vs.  total  failure  with  no  possibility  of  a  partial  success),  the 
user  of  the  classifier  is  forced  to  make  one  of  the  two  choices,  the  cost  of 
misclassification  is  the  same  for  every  future  observation,  and  the  ratio  of  the 
cost  of  a  false  positive  to  that  of  a  false  negative  equals  the  (often  hidden) 
ratio  implied  by  the  analyst’s  classification  rule.  Even  if  all  of  those  condi¬ 
tions  are  met,  classification  is  still  inferior  to  probability  modeling  for  driving 
the  development  of  a  predictive  instrument  or  for  estimation  or  hypothesis 
testing.  It  is  far  better  to  use  the  full  information  in  the  data  to  develop  a 
probability  model,  then  develop  classification  rules  on  the  basis  of  estimated 
probabilities.  At  the  least,  this  forces  the  analyst  to  use  a  proper  accuracy 
score2  in  finding  or  weighting  data  features. 

When  the  dependent  variable  is  ordinal  or  continuous,  classification  through 
forced  up-front  dichotomization  in  an  attempt  to  simplify  the  problem  results 
in  arbitrariness  and  major  information  loss  even  when  the  optimum  cut  point 


a  For  example,  unadjusted  odds  ratios  from  2x2  tables  are  different  from  adjusted 
odds  ratios  when  there  is  variation  in  subjects’  risk  factors  within  each  treatment 
group,  even  when  the  distribution  of  the  risk  factors  is  identical  between  the  two 
groups. 
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(the  median)  is  used.  Dichtomizing  the  outcome  at  a  different  point  may  re¬ 
quire  a  many-fold  increase  in  sample  size  to  make  up  for  the  lost  informa¬ 
tion  ' .  In  the  area  of  medical  diagnosis,  it  is  often  the  case  that  the  disease 
is  really  on  a  continuum,  and  predicting  the  severity  of  disease  (rather  than 
just  its  presence  or  absence)  will  greatly  increase  power  and  precision,  not  to 
mention  making  the  result  less  arbitrary. 

It  is  important  to  note  that  two-group  classification  represents  an  artificial 
forced  choice.  It  is  not  often  the  case  that  the  user  of  the  classifier  needs  to 
be  limited  to  two  possible  actions.  The  best  option  for  many  subjects  may 
be  to  refuse  to  make  a  decision  or  to  obtain  more  data  (e.g.,  order  another 
medical  diagnostic  test).  A  gray  zone  can  be  helpful,  and  predictions  include 
gray  zones  automatically. 

Unlike  prediction  (e.g.,  of  absolute  risk),  classification  implicitly  uses  util¬ 
ity  functions  (also  called  loss  or  cost  functions,  e.g.,  cost  of  a  false  positive 
classification).  Implicit  utility  functions  are  highly  problematic.  First,  it  is 
well  known  that  the  utility  function  depends  on  variables  that  are  not  pre¬ 
dictive  of  outcome  and  are  not  collected  (e.g.,  subjects’  preferences)  that 
are  available  only  at  the  decision  point.  Second,  the  approach  assumes  every 
subject  has  the  same  utility  function b.  Third,  the  analyst  presumptuously 
assumes  that  the  subject’s  utility  coincides  with  his  own. 

Formal  decision  analysis  uses  subject-specific  utilities  and  optimum  predic¬ 
tions  based  on  all  available  data62, 74,183> 210, 219, 642c.  jy  f0q0WS  that  receiver 


b  Simple  examples  to  the  contrary  are  the  less  weight  given  to  a  false  negative  diagno¬ 
sis  of  cancer  in  the  elderly  and  the  aversion  of  some  subjects  to  surgery  or  chemother¬ 
apy. 

c  To  make  an  optimal  decision  you  need  to  know  all  relevant  data  about  an  individual 
(used  to  estimate  the  probability  of  an  outcome),  and  the  utility  (cost,  loss  function) 
of  making  each  decision.  Sensitivity  and  specificity  do  not  provide  this  information. 
For  example,  if  one  estimated  that  the  probability  of  a  disease  given  age,  sex,  and 
symptoms  is  0.1  and  the  “cost”of  a  false  positive  equaled  the  “cost”  of  a  false  negative, 
one  would  act  as  if  the  person  does  not  have  the  disease.  Given  other  utilities,  one 
would  make  different  decisions.  If  the  utilities  are  unknown,  one  gives  the  best  estimate 
of  the  probability  of  the  outcome  to  the  decision  maker  and  let  her  incorporate  her 
own  unspoken  utilities  in  making  an  optimum  decision  for  her. 

Besides  the  fact  that  cutoffs  that  are  not  individualized  do  not  apply  to  individuals, 
only  to  groups,  individual  decision  making  does  not  utilize  sensitivity  and  specificity. 
For  an  individual  we  can  compute  Prob(Y  =  1\X  =  x)\  we  don’t  care  about  Prob(Y  = 
1\X  >  c),  and  an  individual  having  X  =  x  would  be  quite  puzzled  if  she  were  given 
Prob(Y  >  c\ future  unknown  Y)  when  she  already  knows  X  =  x  so  X  is  no  longer  a 
random  variable. 

Even  when  group  decision  making  is  needed,  sensitivity  and  specificity  can  be 
bypassed.  For  mass  marketing,  for  example,  one  can  rank  order  individuals  by  the 
estimated  probability  of  buying  the  product,  to  create  a  lift  curve.  This  is  then  used 
to  target  the  k  most  likely  buyers  where  k  is  chosen  to  meet  total  program  cost 
constraints. 
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operating  characteristic  curve  (ROCd)  analysis  is  misleading  except  for  the 
special  case  of  mass  one-time  group  decision  making  with  unknown  utilities 
(e.g.,  launching  a  flu  vaccination  program). 

An  analyst’s  goal  should  be  the  development  of  the  most  accurate  and 
reliable  predictive  model  or  the  best  model  on  which  to  base  estimation  or 
hypothesis  testing.  In  the  vast  majority  of  cases,  classification  is  the  task  of 
the  user  of  the  predictive  model,  at  the  point  in  which  utilities  (costs)  and 
preferences  are  known. 


1.4  Planning  for  Modeling 

When  undertaking  the  development  of  a  model  to  predict  a  response,  one 
of  the  first  questions  the  researcher  must  ask  is  “will  this  model  actually  be 
used?”  Many  models  are  never  used,  for  several  reasons522  including:  (1)  it 
was  not  deemed  relevant  to  make  predictions  in  the  setting  envisioned  by 
the  authors;  (2)  potential  users  of  the  model  did  not  trust  the  relationships, 
weights,  or  variables  used  to  make  the  predictions;  and  (3)  the  variables 
necessary  to  make  the  predictions  were  not  routinely  available. 

Once  the  researcher  convinces  herself  that  a  predictive  model  is  worth 
developing,  there  are  many  study  design  issues  to  be  addressed.18,378  Models 
are  often  developed  using  a  “convenience  sample,”  that  is,  a  dataset  that  was 
not  collected  with  such  predictions  in  mind.  The  resulting  models  are  often 
fraught  with  difficulties  such  as  the  following. 

1.  The  most  important  predictor  or  response  variables  may  not  have  been 
collected,  tempting  the  researchers  to  make  do  with  variables  that  do  not 
capture  the  real  underlying  processes. 

2.  The  subjects  appearing  in  the  dataset  are  ill-defined,  or  they  are  not  repre¬ 
sentative  of  the  population  for  which  inferences  are  to  be  drawn;  similarly, 
the  data  collection  sites  may  not  represent  the  kind  of  variation  in  the 
population  of  sites. 

3.  Key  variables  are  missing  in  large  numbers  of  subjects. 

4.  Data  are  not  missing  at  random;  for  example,  data  may  not  have  been 
collected  on  subjects  who  dropped  out  of  a  study  early,  or  on  patients  who 
were  too  sick  to  be  interviewed. 

5.  Operational  definitions  of  some  of  the  key  variables  were  never  made. 

6.  Observer  variability  studies  may  not  have  been  done,  so  that  the  relia¬ 
bility  of  measurements  is  unknown,  or  there  are  other  kinds  of  important 
measurement  errors. 

A  predictive  model  will  be  more  accurate,  as  well  as  useful,  when  data  col¬ 
lection  is  planned  prospectively.  That  way  one  can  design  data  collection 

d  The  ROC  curve  is  a  plot  of  sensitivity  vs.  one  minus  specificity  as  one  varies  a 
cutoff  on  a  continuous  predictor  used  to  make  a  decision. 
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instruments  containing  the  necessary  variables,  and  all  terms  can  be  given 
standard  definitions  (for  both  descriptive  and  response  variables)  for  use  at 
all  data  collection  sites.  Also,  steps  can  be  taken  to  minimize  the  amount  of 
missing  data. 

In  the  context  of  describing  and  modeling  health  outcomes,  Iezzoni  has 
an  excellent  discussion  of  the  dimensions  of  risk  that  should  be  captured  by 
variables  included  in  the  model.  She  lists  these  general  areas  that  should  be 
quantified  by  predictor  variables: 

1.  age, 

2.  sex, 

3.  acute  clinical  stability, 

4.  principal  diagnosis, 

5.  severity  of  principal  diagnosis, 

6.  extent  and  severity  of  comorbidities, 

7.  physical  functional  status, 

8.  psychological,  cognitive,  and  psychosocial  functioning, 

9.  cultural,  ethnic,  and  socioeconomic  attributes  and  behaviors, 

10.  health  status  and  quality  of  life,  and 

11.  patient  attitudes  and  preferences  for  outcomes. 


Some  baseline  covariates  to  be  sure  to  capture  in  general  include 

1.  a  baseline  measurement  of  the  response  variable, 

2.  the  subject’s  most  recent  status, 

3.  the  subject’s  trajectory  as  of  time  zero  or  past  levels  of  a  key  variable, 

4.  variables  explaining  much  of  the  variation  in  the  response,  and 

5.  more  subtle  predictors  whose  distributions  strongly  differ  between  the 
levels  of  a  key  variable  of  interest  in  an  observational  study. 

Many  things  can  go  wrong  in  statistical  modeling,  including  the  following. 

1.  The  process  generating  the  data  is  not  stable. 

2.  The  model  is  misspecified  with  regard  to  nonlinearities  or  interactions,  or 
there  are  predictors  missing. 

3.  The  model  is  misspecified  in  terms  of  the  transformation  of  the  response 
variable  or  the  model’s  distributional  assumptions. 

4.  The  model  contains  discontinuities  (e.g.,  by  categorizing  continuous  predic¬ 
tors  or  fitting  regression  shapes  with  sudden  changes)  that  can  be  gamed 
by  users. 

5.  Correlations  among  subjects  are  not  specified,  or  the  correlation  structure 
is  misspecified,  resulting  in  inefficient  parameter  estimates  and  overconfi¬ 
dent  inference. 

6.  The  model  is  overfitted,  resulting  in  predictions  that  are  too  extreme  or 
positive  associations  that  are  false. 
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7.  The  user  of  the  model  relies  on  predictions  obtained  by  extrapolating  to 
combinations  of  predictor  values  well  outside  the  range  of  the  dataset  used 
to  develop  the  model. 

8.  Accurate  and  discriminating  predictions  can  lead  to  behavior  changes  that 
make  future  predictions  inaccurate. 


1.4-1  Emphasizing  Continuous  Variables 

When  designing  the  data  collection  it  is  important  to  emphasize  the  use  of 
continuous  variables  over  categorical  ones.  Some  categorical  variables  are  sub¬ 
jective  and  hard  to  standardize,  and  on  the  average  they  do  not  contain  the 
same  amount  of  statistical  information  as  continuous  variables.  Above  all,  it 
is  unwise  to  categorize  naturally  continuous  variables  during  data  collection,6 
as  the  original  values  can  then  not  be  recovered,  and  if  another  researcher 
feels  that  the  (arbitrary)  cutoff  values  were  incorrect,  other  cutoffs  cannot 
be  substituted.  Many  researchers  make  the  mistake  of  assuming  that  catego¬ 
rizing  a  continuous  variable  will  result  in  less  measurement  error.  This  is  a 
false  assumption,  for  if  a  subject  is  placed  in  the  wrong  interval  this  will  be 
as  much  as  a  100%  error.  Thus  the  magnitude  of  the  error  multiplied  by  the 
probability  of  an  error  is  no  better  with  categorization. 


1.5  Choice  of  the  Model 


The  actual  method  by  which  an  underlying  statistical  model  should  be  chosen 
by  the  analyst  is  not  well  developed.  A.  P.  Dawid  is  quoted  in  Lehmann 
as  saying  the  following. 

Where  do  probability  models  come  from?  To  judge  by  the  resounding  silence 
over  this  question  on  the  part  of  most  statisticians,  it  seems  highly  embarrass¬ 
ing.  In  general,  the  theoretician  is  happy  to  accept  that  his  abstract  probability 
triple  (h?,  A,  P)  was  found  under  a  gooseberry  bush,  while  the  applied  statisti¬ 
cian’s  model  “just  growed”. 

In  biostatistics,  epidemiology,  economics,  psychology,  sociology,  and  many 
other  fields  it  is  seldom  the  case  that  subject  matter  knowledge  exists  that 
would  allow  the  analyst  to  pre-specify  a  model  (e.g.,  Weibull  or  log-normal 
survival  model),  a  transformation  for  the  response  variable,  and  a  structure 

e  An  exception  may  be  sensitive  variables  such  as  income  level.  Subjects  may  be  more 
willing  to  check  a  box  corresponding  to  a  wide  interval  containing  their  income.  It 
is  unlikely  that  a  reduction  in  the  probability  that  a  subject  will  inflate  her  income 
will  offset  the  loss  of  precision  due  to  categorization  of  income,  but  there  will  be  a 
decrease  in  the  number  of  refusals.  This  reduction  in  missing  data  can  more  than 
offset  the  lack  of  precision. 


1.5  Choice  of  the  Model 
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for  how  predictors  appear  in  the  model  (e.g.,  transformations,  addition  of 
nonlinear  terms,  interaction  terms).  Indeed,  some  authors  question  whether 
the  notion  of  a  true  model  even  exists  in  many  cases.100  We  are  for  bet¬ 
ter  or  worse  forced  to  develop  models  empirically  in  the  majority  of  cases. 
Fortunately,  careful  and  objective  validation  of  the  accuracy  of  model  pre¬ 
dictions  against  observable  responses  can  lend  credence  to  a  model,  if  a  good 
validation  is  not  merely  the  result  of  overfitting  (see  Section  5.3). 

There  are  a  few  general  guidelines  that  can  help  in  choosing  the  basic  form 
of  the  statistical  model. 

1.  The  model  must  use  the  data  efficiently.  If,  for  example,  one  were  inter¬ 
ested  in  predicting  the  probability  that  a  patient  with  a  specific  set  of 
characteristics  would  live  five  years  from  diagnosis,  an  inefficient  model 
would  be  a  binary  logistic  model.  A  more  efficient  method,  and  one  that 
would  also  allow  for  losses  to  follow-up  before  five  years,  would  be  a  semi- 
parametric  (rank  based)  or  parametric  survival  model.  Such  a  model  uses 
individual  times  of  events  in  estimating  coefficients,  but  it  can  easily  be 
used  to  estimate  the  probability  of  surviving  five  years.  As  another  exam¬ 
ple,  if  one  were  interested  in  predicting  patients’  quality  of  life  on  a  scale 
of  excellent,  very  good,  good,  fair,  and  poor,  a  polytomous  (multinomial) 
categorical  response  model  would  not  be  efficient  as  it  would  not  make  use 
of  the  ordering  of  responses. 

2.  Choose  a  model  that  fits  overall  structures  likely  to  be  present  in  the 
data.  In  modeling  survival  time  in  chronic  disease  one  might  feel  that  the 
importance  of  most  of  the  risk  factors  is  constant  over  time.  In  that  case, 
a  proportional  hazards  model  such  as  the  Cox  or  Weibull  model  would 
be  a  good  initial  choice.  If  on  the  other  hand  one  were  studying  acutely 
ill  patients  whose  risk  factors  wane  in  importance  as  the  patients  survive 
longer,  a  model  such  as  the  log-normal  or  log-logistic  regression  model 
would  be  more  appropriate. 

3.  Choose  a  model  that  is  robust  to  problems  in  the  data  that  are  difficult  to 
check.  For  example,  the  Cox  proportional  hazards  model  and  ordinal  logis¬ 
tic  models  are  not  affected  by  monotonic  transformations  of  the  response 
variable. 

4.  Choose  a  model  whose  mathematical  form  is  appropriate  for  the  response 
being  modeled.  This  often  has  to  do  with  minimizing  the  need  for  in¬ 
teraction  terms  that  are  included  only  to  address  a  basic  lack  of  fit.  For 
example,  many  researchers  have  used  ordinary  linear  regression  models 
for  binary  responses,  because  of  their  simplicity.  But  such  models  allow 
predicted  probabilities  to  be  outside  the  interval  [0,1],  and  strange  in¬ 
teractions  among  the  predictor  variables  are  needed  to  make  predictions 
remain  in  the  legal  range. 

5.  Choose  a  model  that  is  readily  extendible.  The  Cox  model,  by  its  use  of 
stratification,  easily  allows  a  few  of  the  predictors,  especially  if  they  are 
categorical,  to  violate  the  assumption  of  equal  regression  coefficients  over 
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time  (proportional  hazards  assumption).  The  continuation  ratio  ordinal 

logistic  model  can  also  be  generalized  easily  to  allow  for  varying  coefficients 

of  some  of  the  predictors  as  one  proceeds  across  categories  of  the  response. 

R.  A.  Fisher  as  quoted  in  Lehmann  had  these  suggestions  about  model 
building:  “(a)  We  must  confine  ourselves  to  those  forms  which  we  know  how 
to  handle,”  and  (b)  “More  or  less  elaborate  forms  will  be  suitable  according 
to  the  volume  of  the  data.”  Ameen  [100,  p.  453]  stated  that  a  good  model  is 
“(a)  satisfactory  in  performance  relative  to  the  stated  objective,  (b)  logically 
sound,  (c)  representative,  (d)  questionable  and  subject  to  on-line  interroga¬ 
tion,  (e)  able  to  accommodate  external  or  expert  information  and  (f)  able  to 
convey  information.” 

It  is  very  typical  to  use  the  data  to  make  decisions  about  the  form  of 
the  model  as  well  as  about  how  predictors  are  represented  in  the  model. 
Then,  once  a  model  is  developed,  the  entire  modeling  process  is  routinely 
forgotten,  and  statistical  quantities  such  as  standard  errors,  confidence  limits, 
P- values,  and  R 2  are  computed  as  if  the  resulting  model  were  entirely  pre¬ 
specified.  However,  Faraway,186  Draper,163  Chatfield,100  Buckland  et  al.80 
and  others  have  written  about  the  severe  problems  that  result  from  treating 
an  empirically  derived  model  as  if  it  were  pre-specified  and  as  if  it  were  the 
correct  model.  As  Chatfield  states  [100,  p.  426]:“It  is  indeed  strange  that  we 
often  admit  model  uncertainty  by  searching  for  a  best  model  but  then  ignore 
this  uncertainty  by  making  inferences  and  predictions  as  if  certain  that  the 
best  fitting  model  is  actually  true.” 

Stepwise  variable  selection  is  one  of  the  most  widely  used  and  abused  of 
all  data  analysis  techniques.  Much  is  said  about  this  technique  later  (see  Sec¬ 
tion  4.3),  but  there  are  many  other  elements  of  model  development  that  will 
need  to  be  accounted  for  when  making  statistical  inferences,  and  unfortu¬ 
nately  it  is  difficult  to  derive  quantities  such  as  confidence  limits  that  are 
properly  adjusted  for  uncertainties  such  as  the  data-based  choice  between  a 
Weibull  and  a  log-normal  regression  model. 

Ye6^8  developed  a  general  method  for  estimating  the  “generalized  degrees 
of  freedom”  (GDF)  for  any  “data  mining”  or  model  selection  procedure  based 
on  least  squares.  The  GDF  is  an  extremely  useful  index  of  the  amount  of 
“data  dredging”  or  overfitting  that  has  been  done  in  a  modeling  process. 
It  is  also  useful  for  estimating  the  residual  variance  with  less  bias.  In  one 
example,  Ye  developed  a  regression  tree  using  recursive  partitioning  involving 
10  candidate  predictor  variables  on  100  observations.  The  resulting  tree  had 
19  nodes  and  GDF  of  76.  The  usual  way  of  estimating  the  residual  variance 
involves  dividing  the  pooled  within- node  sum  of  squares  by  100  —  19,  but  Ye 
showed  that  dividing  by  100  —  76  instead  yielded  a  much  less  biased  (and 
much  higher)  estimate  of  a2.  In  another  example,  Ye  considered  stepwise 
variable  selection  using  20  candidate  predictors  and  22  observations.  When 
there  is  no  true  association  between  any  of  the  predictors  and  the  response, 
Ye  found  that  GDF  =  14.1  for  a  strategy  that  selected  the  best  five- variable 
model. 


1.6  Further  Reading 
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Given  that  the  choice  of  the  model  has  been  made  (e.g.,  a  log-normal 
model),  penalized  maximum  likelihood  estimation  has  major  advantages  in 
the  battle  between  making  the  model  fit  adequately  and  avoiding  overfitting 
(Sections  9.10  and  13.4.7).  Penalization  lessens  the  need  for  model  selection. 


1.6  Further  Reading 


Briggs  and  Zaretzki'4  eloquently  state  the  problem  with  ROC  curves  and  the 
areas  under  them  (AUC): 


Statistics  such  as  the  AUC  are  not  especially  relevant  to  someone  who 
must  make  a  decision  about  a  particular  xc.  ...  ROC  curves  lack  or  ob¬ 
scure  several  quantities  that  are  necessary  for  evaluating  the  operational 
effectiveness  of  diagnostic  tests.  .  . .  ROC  curves  were  first  used  to  check 
how  radio  receivers  (like  radar  receivers)  operated  over  a  range  of  fre¬ 
quencies.  .  .  .  This  is  not  how  must  ROC  curves  are  used  now,  particularly 
in  medicine.  The  receiver  of  a  diagnostic  measurement  . .  .  wants  to  make 
a  decision  based  on  some  xc,  and  is  not  especially  interested  in  how  well 
he  would  have  done  had  he  used  some  different  cutoff. 


In  the  discussion  to  their  paper,  David  Hand  states 

When  integrating  to  yield  the  overall  AUC  measure,  it  is  necessary  to 
decide  what  weight  to  give  each  value  in  the  integration.  The  AUC  im¬ 
plicitly  does  this  using  a  weighting  derived  empirically  from  the  data. 
This  is  nonsensical.  The  relative  importance  of  misclassifying  a  case  as 
a  noncase,  compared  to  the  reverse,  cannot  come  from  the  data  itself.  It 
must  come  externally,  from  considerations  of  the  severity  one  attaches  to 
the  different  kinds  of  misclassihcations. 
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AUC,  only  because  it  equals  the  concordance  probability  in  the  binary  Y  case, 
is  still  often  useful  as  a  predictive  discrimination  measure. 

More  severe  problems  caused  by  dichotomizing  continuous  variables  are  dis¬ 
cussed  in  [13, 17, 45, 82, 185,  294, 379,  521,  597] . 

See  the  excellent  editorial  by  Mallows434  for  more  about  model  choice.  See 
Breiman  and  discussants67  for  an  interesting  debate  about  the  use  of  data 
models  vs.  algorithms.  This  material  also  covers  interpretability  vs.  predictive 
accuracy  and  several  other  topics. 

See  [15,80,100,163,186,415]  for  information  about  accounting  for  model  selec¬ 
tion  in  making  final  inferences.  Faraway186  demonstrated  that  the  bootstrap 
has  good  potential  in  related  although  somewhat  simpler  settings,  and  Buck- 
land  et  al.80  developed  a  promising  bootstrap  weighting  method  for  accounting 
for  model  uncertainty. 

Tibshirani  and  Knight611  developed  another  approach  to  estimating  the  gener¬ 
alized  degrees  of  freedom.  Luo  et  al.430  developed  a  way  to  add  noise  of  known 
variance  to  the  response  variable  to  tune  the  stopping  rule  used  for  variable 
selection.  Zou  et  al.689  showed  that  the  lasso,  an  approach  that  simultaneously 
selects  variables  and  shrinks  coefficients,  has  a  nice  property.  Since  it  uses  pe¬ 
nalization  (shrinkage),  an  unbiased  estimate  of  its  effective  number  of  degrees 
of  freedom  is  the  number  of  nonzero  regression  coefficients  in  the  final  model. 


Chapter  2 

General  Aspects  of  Fitting 
Regression  Models 


2.1  Notation  for  Multivariable  Regression  Models 

The  ordinary  multiple  linear  regression  model  is  frequently  used  and  has 
parameters  that  are  easily  interpreted.  In  this  chapter  we  study  a  general 
class  of  regression  models,  those  stated  in  terms  of  a  weighted  sum  of  a  set 
of  independent  or  predictor  variables.  It  is  shown  that  after  linearizing  the 
model  with  respect  to  the  predictor  variables,  the  parameters  in  such  re¬ 
gression  models  are  also  readily  interpreted.  Also,  all  the  designs  used  in 
ordinary  linear  regression  can  be  used  in  this  general  setting.  These  designs 
include  analysis  of  variance  (ANOVA)  setups,  interaction  effects,  and  nonlin¬ 
ear  effects.  Besides  describing  and  interpreting  general  regression  models,  this 
chapter  also  describes,  in  general  terms,  how  the  three  types  of  assumptions 
of  regression  models  can  be  examined. 

First  we  introduce  notation  for  regression  models.  Let  Y  denote  the  re¬ 
sponse  (dependent)  variable,  and  let  X  =  Xi,  X2, . . . ,  Xp  denote  a  list  or 
vector  of  predictor  variables  (also  called  covariables  or  independent,  descrip¬ 
tor,  or  concomitant  variables).  These  predictor  variables  are  assumed  to  be 
constants  for  a  given  individual  or  subject  from  the  population  of  interest. 
Let  f3  =  /3o,  /?i, . . . ,  f3p  denote  the  list  of  regression  coefficients  (parameters). 
Po  is  an  optional  intercept  parameter,  and  /?i, . . . ,  ftp  are  weights  or  regression 
coefficients  corresponding  to  Xi, . . .  ,XP.  We  use  matrix  or  vector  notation 
to  describe  a  weighted  sum  of  the  Xs: 

XP  =  f30+piX1  +  ...  +  f3pXp,  (2.1) 

where  there  is  an  implied  Xo  =  1. 

A  regression  model  is  stated  in  terms  of  a  connection  between  the  predic¬ 
tors  X  and  the  response  Y.  Let  C(Y \X)  denote  a  property  of  the  distribution 
of  Y  given  X  (as  a  function  of  A).  For  example,  C(Y \X)  could  be  E(Y |X), 
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the  expected  value  or  average  of  Y  given  X,  or  C{Y\X)  could  be  the  proba¬ 
bility  that  Y  —  1  given  X  (where  Y  =  0  or  1). 


2.2  Model  Formulations 

We  define  a  regression  function  as  a  function  that  describes  interesting  prop¬ 
erties  of  Y  that  may  vary  across  individuals  in  the  population.  X  describes  the 
list  of  factors  determining  these  properties.  Stated  mathematically,  a  general 
regression  model  is  given  by 


C(Y\X)=g(X).  (2.2) 

We  restrict  our  attention  to  models  that,  after  a  certain  transformation,  are 
linear  in  the  unknown  parameters,  that  is,  models  that  involve  X  only  through 
a  weighted  sum  of  all  the  Xs.  The  general  linear  regression  model  is  given  by 

C(Y\X)=g(X/3).  (2.3) 

For  example,  the  ordinary  linear  regression  model  is 

C(Y\X)  =  E(Y  |X)  =  X/3,  (2.4) 

and  given  X,  Y  has  a  normal  distribution  with  mean  X/3  and  constant  vari¬ 
ance  a2.  The  binary  logistic  regression  model  29,64 ‘  is 

C(Y  |X)  =  Prob{T  =  1|X}  =  (1  +  exp(— X/3))-1,  (2.5) 

where  Y  can  take  on  the  values  0  and  1.  In  general  the  model,  when 
stated  in  terms  of  the  property  C(Y |X),  may  not  be  linear  in  X/3;  that 
is  C(Y |X)  =  g(X/3),  where  g(u)  is  nonlinear  in  u.  For  example,  a  regression 
model  could  be  E(Y |X)  =  (X/3)-5.  The  model  may  be  made  linear  in  the 
unknown  parameters  by  a  transformation  in  the  property  C(Y |X): 

h(C(Y\X))  =  X/3,  (2.6) 

where  h(u)  =  g~l{u ),  the  inverse  function  of  g.  As  an  example  consider  the 
binary  logistic  regression  model  given  by 

C(Y\X)  =  Prob{T  =  1|X}  =  (1  +  exp(— X/3))-1.  (2.7) 

If  h(u)  =  logit (u)  =  log(u/(l  —  u)),  the  transformed  model  becomes 


h(Prob(T  =  1|X))  =  log(exp(X/3))  =  X/3. 


(2.8) 


2.3  Interpreting  Model  Parameters 
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The  transformation  h(C(Y\X))  is  sometimes  called  a  link  function.  Let 
h(C(Y\X))  be  denoted  by  C'(Y \X).  The  general  linear  regression  model  then 
becomes 

C'(Y\X)  =  Xp.  (2.9) 

In  other  words,  the  model  states  that  some  property  C'  of  Y ,  given  X,  is 
a  weighted  sum  of  the  Xs  (X/3).  In  the  ordinary  linear  regression  model, 
C'(Y |X)  =  E(Y |X).  In  the  logistic  regression  case,  C'(Y |X)  is  the  logit  of 
the  probability  that  Y  =  1,  logProb{T  =  1 } /  [1  —  Prob{T  =  1}].  This  is  the 
log  of  the  odds  that  Y  —  l  versus  Y  =  0. 

It  is  important  to  note  that  the  general  linear  regression  model  has  two 
major  components:  C'(Y |X)  and  X/3.  The  first  part  has  to  do  with  a  property 
or  transformation  of  Y.  The  second,  X/3,  is  the  linear  regression  or  linear 
predictor  part.  The  method  of  least  squares  can  sometimes  be  used  to  fit 
the  model  if  C'(Y |X)  =  E(Y |X).  Other  cases  must  be  handled  using  other 
methods  such  as  maximum  likelihood  estimation  or  nonlinear  least  squares. 


2.3  Interpreting  Model  Parameters 

In  the  original  model,  C(Y |X)  specifies  the  way  in  which  X  affects  a  property 
of  Y .  Except  in  the  ordinary  linear  regression  model,  it  is  difficult  to  interpret 
the  individual  parameters  if  the  model  is  stated  in  terms  of  C(Y |X).  In  the 
model  C'(Y |X)  =  X/3  =  /3 o  +  f3\X\  -f  ...  +  /3PXP,  the  regression  parameter 
/3j  is  interpreted  as  the  change  in  the  property  C’  of  Y  per  unit  change  in 
the  descriptor  variable  Xj,  all  other  descriptors  remaining  constant21: 

Pj  =  c'(r|Xi,x2,...,xi  +  i,...,xp)  -c'{Y\x1,x2,...,xj,...,xp). 

(2.10) 

In  the  ordinary  linear  regression  model,  for  example,  Pj  is  the  change  in 
expected  value  of  Y  per  unit  change  in  Xj.  In  the  logistic  regression  model 
Pj  is  the  change  in  log  odds  that  Y  =  1  per  unit  change  in  Xj.  When  a 
non-interacting  Xj  is  a  dichotomous  variable  or  a  continuous  one  that  is 
linearly  related  to  C;,  Xj  is  represented  by  a  single  term  in  the  model  and 
its  contribution  is  described  fully  by  pj. 

In  all  that  follows,  we  drop  the  '  from  C'  and  assume  that  C(Y |X)  is  the 
property  of  Y  that  is  linearly  related  to  the  weighted  sum  of  the  Xs. 


a  Note  that  it  is  not  necessary  to  “hold  constant”  all  other  variables  to  be  able  to 
interpret  the  effect  of  one  predictor.  It  is  sufficient  to  hold  constant  the  weighted  sum 
of  all  the  variables  other  than  Xj .  And  in  many  cases  it  is  not  physically  possible  to 
hold  other  variables  constant  while  varying  one,  e.g.,  when  a  model  contains  X  and 
X2  (David  Hoaglin,  personal  communication). 
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2.3.1  Nominal  Predictors 

Suppose  that  we  wish  to  model  the  effect  of  two  or  more  treatments  and  be 
able  to  test  for  differences  between  the  treatments  in  some  property  of  Y. 
A  nominal  or  polytomous  factor  such  as  treatment  group  having  k  levels,  in 
which  there  is  no  definite  ordering  of  categories,  is  fully  described  by  a  series  of 
k  —  1  binary  indicator  variables  (sometimes  called  dummy  variables).  Suppose 
that  there  are  four  treatments,  J,  iF,  L,  and  M,  and  the  treatment  factor  is 
denoted  by  T .  The  model  can  be  written  as 

C(Y\T  =  J)  =p0 

C{Y\T  =  K)=f30  +  f31  (2.11) 

C(Y\T  =  L)  =  A)  +  /32 
C(Y\T  =  M)  = 

The  four  treatments  are  thus  completely  specified  by  three  regression  param¬ 
eters  and  one  intercept  that  we  are  using  to  denote  treatment  J,  the  reference 
treatment.  This  model  can  be  written  in  the  previous  notation  as 

C(Y\T)  =  XP  =  p0  +  PiXi  +  (32X2  +  /33X3,  (2.12) 


where 


X\  =  1  if  T  =  K.  0  otherwise 

X2  =  1  if  T  =  L,  0  otherwise  (2.13) 

X3  =  1  if  T  =  M,  0  otherwise. 

For  treatment  J  (T  =  J),  all  three  Xs  are  zero  and  C{Y\T  =  J)  =  /? q. 

The  test  for  any  differences  in  the  property  C(Y)  between  treatments  is 

Ho  :  pi  =  fd2  =  /?3  =  0. 

This  model  is  an  analysis  of  variance  or  k-sample- type  model.  If  there  are 
other  descriptor  covariables  in  the  model,  it  becomes  an  analysis  of  covari- 
ance- type  model. 


2.3.2  Interactions 

Suppose  that  a  model  has  descriptor  variables  X\  and  X2  and  that  the  effect 
of  the  two  Xs  cannot  be  separated;  that  is  the  effect  of  X\  on  Y  depends  on 
the  level  of  X2  and  vice  versa.  One  simple  way  to  describe  this  interaction  is 
to  add  the  constructed  variable  X3  =  X1X2  to  the  model: 


C(Y\X)  =  (30+  PiXr  +  t32X2  +  ^XxX2. 


(2.14) 


2.3  Interpreting  Model  Parameters 
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It  is  now  difficult  to  interpret  pi  and  /?2  in  isolation.  However,  we  may  quantify 
the  effect  of  a  one-unit  increase  in  X\  if  X2  is  held  constant  as 


Table  2.1  Parameters  in  a  simple  model  with  interaction 

Parameter  Meaning 

Po  C(Y\age  =  0,  sex  =  m) 

Pi  C{Y\age  =  x  +  1,  sex  =  rn)  —  C(Y\age  =  x,  sex  =  m) 

P2  C(Y\age  =  0,  sex  =  /)  —  C(Y"|age  =  0,  sex  =  m) 

Po  C(Y\age  =  x  +  1,  sex  =  /)  —  C(M|age  =  x,  sex  =  /)  — 

[C(M|a^e  =  x  +  1,  sex  =  m)  —  C(Y\age  =  x,  sex  =  m) 


c(y\x1  +  i,x2)  -c(y\xux2) 

=  Po  P\{X\  +  1)  +  P2X2 
+  Pz{Xi  +  1)X2  (2.15) 

-  [Po  +  P1X1  +  /32X2  +  /33X!X2] 

=  /3i+/33X2. 

Likewise,  the  effect  of  a  one-unit  increase  in  X2  on  C  if  X\  is  held  constant  is 
P2  +  P3X1.  Interactions  can  be  much  more  complex  than  can  be  modeled  with 
a  product  of  two  terms.  If  X\  is  binary,  the  interaction  may  take  the  form 
of  a  difference  in  shape  (and/or  distribution)  of  X2  versus  C(Y)  depending 
on  whether  X\  =  0  or  X\  =  1  (e.g.,  logarithm  vs.  square  root).  When  both 
variables  are  continuous,  the  possibilities  are  much  greater  (this  case  is  dis¬ 
cussed  later).  Interactions  among  more  than  two  variables  can  be  exceedingly 
complex. 


2.3.3  Example:  Inference  for  a  Simple  Model 


Suppose  we  postulated  the  model 


C(Y\age,  sex)  =  po  +  P\age  +  p2[sex  =  f]  +  poage[sex  =  /], 


where  [sex  =  f]  is  a  0-1  indicator  variable  for  sex  =  female;  the  reference  cell 
is  sex  =  male  corresponding  to  a  zero  value  of  the  indicator  variable.  This  is 
a  model  that  assumes 


1.  age  is  linearly  related  to  C(Y)  for  males, 

2.  age  is  linearly  related  to  C(Y)  for  females,  and 

3.  whatever  distribution,  variance,  and  independence  assumptions  are  appro¬ 
priate  for  the  model  being  considered. 
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We  are  thus  assuming  that  the  interaction  between  age  and  sex  is  simple; 
that  is  it  only  alters  the  slope  of  the  age  effect.  The  parameters  in  the  model 
have  interpretations  shown  in  Table  2.1.  /3s  is  the  difference  in  slopes  (female 
-  male). 

There  are  many  useful  hypotheses  that  can  be  tested  for  this  model.  First 
let’s  consider  two  hypotheses  that  are  seldom  appropriate  although  they  are 
routinely  tested. 

1.  Hq  :  pi  =  0:  This  tests  whether  age  is  associated  with  Y  for  males. 

2.  Hq  :  =  0:  This  tests  whether  sex  is  associated  with  Y  for  zero-year  olds. 

Now  consider  more  useful  hypotheses.  For  each  hypothesis  we  should  write 
what  is  being  tested,  translate  this  to  tests  in  terms  of  parameters,  write  the 
alternative  hypothesis,  and  describe  what  the  test  has  maximum  power  to 
detect.  The  latter  component  of  a  hypothesis  test  needs  to  be  emphasized,  as 
almost  every  statistical  test  is  focused  on  one  specific  pattern  to  detect.  For 
example,  a  test  of  association  against  an  alternative  hypothesis  that  a  slope 
is  nonzero  will  have  maximum  power  when  the  true  association  is  linear. 
If  the  true  regression  model  is  exponential  in  X,  a  linear  regression  test 
will  have  some  power  to  detect  “non-flatness”  but  it  will  not  be  as  powerful 
as  the  test  from  a  well-specified  exponential  regression  effect.  If  the  true 
effect  is  U-shaped,  a  test  of  association  based  on  a  linear  model  will  have 
almost  no  power  to  detect  association.  If  one  tests  for  association  against 
a  quadratic  (parabolic)  alternative,  the  test  will  have  some  power  to  detect 
a  logarithmic  shape  but  it  will  have  very  little  power  to  detect  a  cyclical 
trend  having  multiple  “humps.”  In  a  quadratic  regression  model,  a  test  of 
linearity  against  a  quadratic  alternative  hypothesis  will  have  reasonable  power 
to  detect  a  quadratic  nonlinear  effect  but  very  limited  power  to  detect  a 
multiphase  cyclical  trend.  Therefore  in  the  tests  in  Table  2.2  keep  in  mind 
that  power  is  maximal  when  linearity  of  the  age  relationship  holds  for  both 
sexes.  In  fact  it  may  be  useful  to  write  alternative  hypotheses  as,  for  example, 
uHa  :  age  is  associated  with  C(Y),  powered  to  detect  a  linear  relationship.” 

Note  that  if  there  is  an  interaction  effect,  we  know  that  there  is  both  an 
age  and  a  sex  effect.  However,  there  can  also  be  age  or  sex  effects  when  the 
lines  are  parallel.  That’s  why  the  tests  of  total  association  have  2  d.f. 


2.4  Relaxing  Linearity  Assumption  for  Continuous 
Predictors 

2.4-1  Avoiding  Categorization 

Relationships  among  variables  are  seldom  linear,  except  in  special  cases 
such  as  when  one  variable  is  compared  with  itself  measured  at  a  different 
time.  It  is  a  common  belief  among  practitioners  who  do  not  study  bias  and 
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efficiency  in  depth  that  the  presence  of  non-linearity  should  be  dealt  with  by 
chopping  continuous  variables  into  intervals.  Nothing  could  be  more  disas¬ 
trous  13, 14, 17, 45, 82, 185, 187, 215, 294, 300, 379, 446, 465, 521, 533, 559, 597, 646 


Table  2.2  Most  Useful  Tests  for  Linear  Age  x  Sex  Model 


Null  or  Alternative  Hypothesis 

Mathematical 

Statement 

Effect  of  age  is  independent  of  sex  or 
Effect  of  sex  is  independent  of  age  or 

Age  and  sex  are  additive 

Age  effects  are  parallel 

H0  :  f3  3=0 

Age  interacts  with  sex 

Age  modifies  effect  of  sex 

Sex  modifies  effect  of  age 

Sex  and  age  are  non-additive  (synergistic) 

Ha  -.0 3^0 

Age  is  not  associated  with  Y 

Age  is  associated  with  Y 

Age  is  associated  with  Y  for  either 
Females  or  males 

Ho  :  fii  =  @3  =  0 

Ha  :  Pi  j=-  0  or  /?3  ^  0 

Sex  is  not  associated  with  Y 

Sex  is  associated  with  Y 

Sex  is  associated  with  Y  for  some 

Value  of  age 

Ho  :  /?2  =  /?3  =  0 

Ha  ■  p2  ^  0  or  p3  ±  0 

Neither  age  nor  sex  is  associated  with  Y 
Either  age  or  sex  is  associated  with  Y 

Ho  :  Pi  =  P2  =  P3  =  0 

Ha:  P i  7^  0  or  /?2  ^  0  or  /?3  ^  0 

Problems  caused  by  dichotomization  include  the  following. 

1.  Estimated  values  will  have  reduced  precision,  and  associated  tests  will  have  re¬ 
duced  power. 

2.  Categorization  assumes  that  the  relationship  between  the  predictor  and  the  re¬ 
sponse  is  flat  within  intervals;  this  assumption  is  far  less  reasonable  than  a  lin¬ 
earity  assumption  in  most  cases. 

3.  To  make  a  continuous  predictor  be  more  accurately  modeled  when  categorization 
is  used,  multiple  intervals  are  required.  The  needed  indicator  variables  will  spend 
more  degrees  of  freedom  than  will  fitting  a  smooth  relationship,  hence  power  and 
precision  will  suffer.  And  because  of  sample  size  limitations  in  the  very  low  and 
very  high  range  of  the  variable,  the  outer  intervals  (e.g.,  outer  quintiles)  will  be 
wide,  resulting  in  significant  heterogeneity  of  subjects  within  those  intervals,  and 
residual  confounding. 

4.  Categorization  assumes  that  there  is  a  discontinuity  in  response  as  interval  bound¬ 
aries  are  crossed.  Other  than  the  effect  of  time  (e.g.,  an  instant  stock  price  drop 
after  bad  news),  there  are  very  few  examples  in  which  such  discontinuities  have 
been  shown  to  exist. 

5.  Categorization  only  seems  to  yield  interpretable  estimates  such  as  odds  ratios. 
For  example,  suppose  one  computes  the  odds  ratio  for  stroke  for  persons  with 
a  systolic  blood  pressure  >  160  mmHg  compared  with  persons  with  a  blood 
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pressure  <  160  mmHg.  The  interpretation  of  the  resulting  odds  ratio  will  depend 
on  the  exact  distribution  of  blood  pressures  in  the  sample  (the  proportion  of 
subjects  >  170,  >  180,  etc.).  On  the  other  hand,  if  blood  pressure  is  modeled  as 
a  continuous  variable  (e.g.,  using  a  regression  spline,  quadratic,  or  linear  effect) 
one  can  estimate  the  ratio  of  odds  for  exact  settings  of  the  predictor,  e.g.,  the 
odds  ratio  for  200  mmHg  compared  with  120  mmHg. 

6.  Categorization  does  not  condition  on  full  information.  When,  for  example,  the 
risk  of  stroke  is  being  assessed  for  a  new  subject  with  a  known  blood  pressure 
(say  162  mmHg),  the  subject  does  not  report  to  her  physician  “my  blood  pressure 
exceeds  160”  but  rather  reports  162  mmHg.  The  risk  for  this  subject  will  be  much 
lower  than  that  of  a  subject  with  a  blood  pressure  of  200  mmHg. 

7.  If  cutpoints  are  determined  in  a  way  that  is  not  blinded  to  the  response  vari¬ 
able,  calculation  of  P-values  and  confidence  intervals  requires  special  simulation 
techniques;  ordinary  inferential  methods  are  completely  invalid.  For  example,  if 
cutpoints  are  chosen  by  trial  and  error  in  a  way  that  utilizes  the  response,  even 
informally,  ordinary  P-values  will  be  too  small  and  confidence  intervals  will  not 
have  the  claimed  coverage  probabilities.  The  correct  Monte-Carlo  simulations 
must  take  into  account  both  multiplicities  and  uncertainty  in  the  choice  of  cut- 
points.  For  example,  if  a  cutpoint  is  chosen  that  minimizes  the  P-value  and  the 
resulting  P-value  is  0.05,  the  true  type  I  error  can  easily  be  above  0.5300. 

8.  Likewise,  categorization  that  is  not  blinded  to  the  response  variable  results  in 
biased  effect  estimates17,559. 

9.  “Optimal”  cutpoints  do  not  replicate  over  studies.  Hollander  et  al.300  state  that 
“.  .  .  the  optimal  cutpoint  approach  has  disadvantages.  One  of  these  is  that  in  al¬ 
most  every  study  where  this  method  is  applied,  another  cutpoint  will  emerge. 
This  makes  comparisons  across  studies  extremely  difficult  or  even  impossible. 
Altman  et  al.  point  out  this  problem  for  studies  of  the  prognostic  relevance  of  the 
S-phase  fraction  in  breast  cancer  published  in  the  literature.  They  identified  19 
different  cutpoints  used  in  the  literature;  some  of  them  were  solely  used  because 
they  emerged  as  the  ‘optimal’  cutpoint  in  a  specific  data  set.  In  a  meta-analysis  on 
the  relationship  between  cathepsin-D  content  and  disease-free  survival  in  node¬ 
negative  breast  cancer  patients,  12  studies  were  in  included  with  12  different 
cutpoints  .  .  .  Interestingly,  neither  cathepsin-D  nor  the  S-phase  fraction  are  rec¬ 
ommended  to  be  used  as  prognostic  markers  in  breast  cancer  in  the  recent  update 
of  the  American  Society  of  Clinical  Oncology.”  Giannoni  et  al.215  demonstrated 
that  many  claimed  “optimal  cutpoints”  are  just  the  observed  median  values  in  the 
sample,  which  happens  to  optimize  statistical  power  for  detecting  a  separation  in 
outcomes  and  have  nothing  to  do  with  true  outcome  thresholds.  Disagreements 
in  cutpoints  (which  are  bound  to  happen  whenever  one  searches  for  things  that 
do  not  exist)  cause  severe  interpretation  problems.  One  study  may  provide  an 
odds  ratio  for  comparing  body  mass  index  (BMI)  >  30  with  BMI  <  30,  another 
for  comparing  BMI  >  28  with  BMI  <  28.  Neither  of  these  odds  ratios  has  a  good 
definition  and  the  two  estimates  are  not  comparable. 

10.  Cutpoints  are  arbitrary  and  manipulatable;  cutpoints  can  be  found  that  can  result 
in  both  positive  and  negative  associations646. 

11.  If  a  confounder  is  adjusted  for  by  categorization,  there  will  be  residual  confound¬ 
ing  that  can  be  explained  away  by  inclusion  of  the  continuous  form  of  the  predictor 
in  the  model  in  addition  to  the  categories. 

When  cutpoints  are  chosen  using  Y ,  categorization  represents  one  of  those 
few  times  in  statistics  where  both  type  I  and  type  II  errors  are  elevated. 

A  scientific  quantity  is  a  quantity  which  can  be  defined  outside  of  the 
specifics  of  the  current  experiment.  The  kind  of  highdow  estimates  that  re¬ 
sult  from  categorizing  a  continuous  variable  are  not  scientific  quantities;  their 
interpretation  depends  on  the  entire  sample  distribution  of  continuous  mea¬ 
surements  within  the  chosen  intervals. 
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To  summarize  problems  with  categorization  it  is  useful  to  examine  its 
effective  assumptions.  Suppose  one  assumes  there  is  a  single  outpoint  c  for 
predictor  X.  Assumptions  implicit  in  seeking  or  using  this  outpoint  include 
(1)  the  relationship  between  X  and  the  response  Y  is  discontinuous  at  X  =  c 
and  only  X  =  c;  (2)  c  is  correctly  found  as  the  outpoint;  (3)  X  vs.  Y  is 
flat  to  the  left  of  c;  (4)  X  vs.  Y  is  flat  to  the  right  of  c;  (5)  the  “optimal11 
outpoint  does  not  depend  on  the  values  of  other  predictors.  Failure  to  have 
these  assumptions  satisfied  will  result  in  great  error  in  estimating  c  (because 
it  doesn’t  exist),  low  predictive  accuracy,  serious  lack  of  model  fit,  residual 
confounding,  and  overestimation  of  effects  of  remaining  variables. 

A  better  approach  that  maximizes  power  and  that  only  assumes  a  smooth 
relationship  is  to  use  regression  splines  for  predictors  that  are  not  known 
to  predict  linearly.  Use  of  flexible  parametric  approaches  such  as  this  allows 
standard  inference  techniques  (P-values,  confidence  limits)  to  be  used,  as 
will  be  described  below.  Before  introducing  splines,  we  consider  the  simplest 
approach  to  allowing  for  nonlinearity. 


2.4-2  Simple  Nonlinear  Terms 

If  a  continuous  predictor  is  represented,  say,  as  X\  in  the  model,  the  model 
is  assumed  to  be  linear  in  X\.  Often,  however,  the  property  of  Y  of  interest 
does  not  behave  linearly  in  all  the  predictors.  The  simplest  way  to  describe 
a  nonlinear  effect  of  X\  is  to  include  a  term  for  X 2  =  X\  in  the  model: 

C(Y\X!)  =  A)  +  All  +  P2X  l  (2.16) 

If  the  model  is  truly  linear  in  Xi,  /U  will  be  zero.  This  model  formulation 
allows  one  to  test  Hq  :  model  is  linear  in  X\  against  Ha  :  model  is  quadratic 
(parabolic)  in  X\  by  testing  Hq  :  /?2  =  0. 

Nonlinear  effects  will  frequently  not  be  of  a  parabolic  nature.  If  a  trans¬ 
formation  of  the  predictor  is  known  to  induce  linearity,  that  transformation 
(e.g.,  log(X))  may  be  substituted  for  the  predictor.  However,  often  the  trans¬ 
formation  is  not  known.  Higher  powers  of  X\  may  be  included  in  the  model 
to  approximate  many  types  of  relationships,  but  polynomials  have  some  un¬ 
desirable  properties  (e.g.,  undesirable  peaks  and  valleys,  and  the  fit  in  one 
region  of  X  can  be  greatly  affected  by  data  in  other  regions433)  and  will  not 
adequately  fit  many  functional  forms.156  For  example,  polynomials  do  not 
adequately  fit  logarithmic  functions  or  “threshold”  effects. 
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2-4-3  Splines  for  Estimating  Shape  of  Regression 
Function  and  Determining  Predictor 
Transformations 

A  draftsman’s  spline  is  a  flexible  strip  of  metal  or  rubber  used  to  draw  curves. 
Spline  functions  are  piecewise  polynomials  used  in  curve  fitting.  That  is,  they 
are  polynomials  within  intervals  of  X  that  are  connected  across  different 
intervals  of  X.  Splines  have  been  used,  principally  in  the  physical  sciences, 
to  approximate  a  wide  variety  of  functions.  The  simplest  spline  function  is  a 
linear  spline  function,  a  piecewise  linear  function.  Suppose  that  the  x  axis  is 
divided  into  intervals  with  endpoints  at  a,  6,  and  c,  called  knots.  The  linear 
spline  function  is  given  by 

f(X)  =  p0  +  PiX  +  p2(X  -  a)+  +  -  b)+  +  fa(X  -  c)+,  (2.17) 

where 


(ix)_l_  =  u,  u  >  0, 

0,  u<0.  (2.18) 

The  number  of  knots  can  vary  depending  on  the  amount  of  available  data  for 
fitting  the  function.  The  linear  spline  function  can  be  rewritten  as 

f(X)  =  ft  +  PiX,  X<a 

=  A)  +  /3i X  +  /32(X-a)  a  <  X  <  b  (2.19) 
=  Po  +  PiX  +  fa(X  -  a)  +  p3(X  -b)b<X  <c 

=  Po  +  piX  +  p2(X-a) 

+/33(X-6)+/34(X-c)  c  <  X. 

A  linear  spline  is  depicted  in  Figure  2.1. 

The  general  linear  regression  model  can  be  written  assuming  only  piecewise 
linearity  in  X  by  incorporating  constructed  variables  X2,X3,  and  X4  : 

C(Y  IX)  =  f(X )  =  XP,  (2.20) 

where  XP  =  Po  +  P\X\  +  /32X2  +  P3X3  +  /34X4,  and 

X1  =  X  X2  =  (X-  a)+ 

X3  =  (X-  b)+  X4  =  (X-  c)+.  (2.21) 

By  modeling  a  slope  increment  for  X  in  an  interval  (a,  b]  in  terms  of  (X  —  a)  +  , 
the  function  is  constrained  to  join  (“meet”)  at  the  knots.  Overall  linearity  in 
X  can  be  tested  by  testing  Hq  :  p2  =  P3  =  Pa  =  0. 
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X 

Fig.  2.1  A  linear  spline  function  with  knots  at  a  =  1,  b  =  3,  c  =  5. 


2-4-4  Cubic  Spline  Functions 


Although  the  linear  spline  is  simple  and  can  approximate  many  common 
relationships,  it  is  not  smooth  and  will  not  fit  highly  curved  functions  well. 
These  problems  can  be  overcome  by  using  piecewise  polynomials  of  order 
higher  than  linear.  Cubic  polynomials  have  been  found  to  have  nice  properties 
with  good  ability  to  fit  sharply  curving  shapes.  Cubic  splines  can  be  made  to 
be  smooth  at  the  join  points  (knots)  by  forcing  the  first  and  second  derivatives 
of  the  function  to  agree  at  the  knots.  Such  a  smooth  cubic  spline  function 
with  three  knots  (a,  6,  c)  is  given  by 

f(X)  =  /30  +  fax  +  /32X2  +  p3X3 

+  /34{X  -  a)3  +  -  b)3+  +  fa{X  -  c)3+  (2.22) 

=  X/3 

with  the  following  constructed  variables: 

Xx=X  x2  =  x2 

X3  =  X3  X4  =  (X-  a)3+  (2.23) 

X5  =  (X-b)l  XQ  =  {X-c)\. 

If  the  cubic  spline  function  has  k  knots,  the  function  will  require  estimat¬ 
ing  k  +  3  regression  coefficients  besides  the  intercept.  See  Section  2.4.6  for 
information  on  choosing  the  number  and  location  of  knots. 

There  are  more  numerically  stable  ways  to  form  a  design  matrix  for  cubic 
spline  functions  that  are  based  on  B-splines  instead  of  the  truncated  power 
basis152,575  used  here.  However,  B-splines  are  more  complex  and  do  not  allow 
for  extrapolation  beyond  the  outer  knots,  and  the  truncated  power  basis 
seldom  presents  estimation  problems  (see  Section  4.6)  when  modern  methods 
such  as  the  Q-R  decomposition  are  used  for  matrix  inversion. 
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2.4-5  Restricted  Cubic  Splines 

Stone  and  Koo595  have  found  that  cubic  spline  functions  do  have  a  drawback 
in  that  they  can  be  poorly  behaved  in  the  tails,  that  is  before  the  first  knot  and 
after  the  last  knot.  They  cite  advantages  of  constraining  the  function  to  be 
linear  in  the  tails.  Their  restricted  cubic  spline  function  (also  called  natural 
splines)  has  the  additional  advantage  that  only  k  —  1  parameters  must  be 
estimated  (besides  the  intercept)  as  opposed  to  k  +  3  parameters  with  the 
unrestricted  cubic  spline.  The  restricted  spline  function  with  k  knots  £i, . . . ,  tk 
is  given  by156 


f(X )  =  p0  +  PiXx  +  p2X2  +  . . .  +  (2.24) 


where  X\  =  X  and  for  j  =  1, . . . ,  k  —  2, 


=  (X  -  tj)\  -  (X  -  tk-i)\(tk  -  tj)/ (tk 
+  (X  -  tk)\(tk- 1  -  tj)/ (tk  -  tk- 1). 


tk—  1 ) 


(2.25) 


It  can  be  shown  that  X3  is  linear  in  X  for  X  >  tk-  For  numerical  behavior  and 
to  put  all  basis  functions  for  X  on  the  same  scale,  R  Hmisc  and  rms  package 
functions  by  default  divide  the  terms  in  Eq.  2.25  by 

r  =  (tk  -  h)2.  (2.26) 


Figure  2.2  displays  the  r-scaled  spline  component  variables  X3  for  j  = 
2,3,4  and  k  =  5  and  one  set  of  knots.  The  left  graph  magnifies  the  lower 
portion  of  the  curves. 

L 

require ( Hmi s  c ) 


Figure  2.3  displays  some  typical  shapes  of  restricted  cubic  spline  functions 
with  k  =  3,4,  5,  and  6.  These  functions  were  generated  using  random  j3. 
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Fig.  2.2  Restricted  cubic  spline  component  variables  for  k  =  5  and  knots  at  X  = 
.05,  .275,  .5,  .725,  and  .95.  Nonlinear  basis  functions  are  scaled  by  r.  The  left  panel 
is  a  ^-magnification  of  the  right  panel.  Fitted  functions  such  as  those  in  Figure  2.3 
will  be  linear  combinations  of  these  basis  functions  as  long  as  knots  are  at  the  same 
locations  used  here. 


X  X 


x  seq(0,  1,  length=300) 
f  or  (nk  in  3:6)  { 

set . seed (nk) 

knots  seq(.05,  .95,  length=nk) 

xx  rcspline  .  eval  (x  ,  knots=knots,  inclx=T) 
for(i  in  1  :  (nk  -  1)) 

xx  [ ,  i ]  «—  ( xx  [ ,  i]  -  min (xx  [ ,  i]  )  )  / 

( max ( xx  [ , i ]  )  -  min(xx  [,  i]  )) 

f  or ( i  in  1  :  20)  { 

beta  2*runif (nk-1)  -  1 

xbeta  xx  °/0*°/o  beta  +  2  *  run  if  (1)  -  1 

xbeta  (xbeta  -  min(xbeta))  / 

(max(xbeta)  -  min(xbeta)) 
if  ( i  ==  1)  { 

plot(x,  xbeta,  type="l",  lty=l , 

xlab=expression(X),  ylab=  '  '  ,  bty="l") 
t itle ( sub=paste (nk ," knots ") ,  ad j =0 ,  cex=.75) 

f  or  ( j  in  1  :  nk ) 

arrows  (knots  [j]  ,  .04  ,  knots  [j]  ,  -  .  03  , 

angle=20,  length=.07,  lwd=1.5) 

} 

else  lines (x,  xbeta,  col=i) 

} 

} 
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Once  /3qj  •  •  • ,  /3fc-i  are  estimated,  the  restricted  cubic  spline  can  be  restated 
in  the  form 


f(X)  =  /30  +  PiX  +  (32{x  -  h)l  +  p3(x  -  t2)l 

+  •  •  •  +  Pk+i(X  —  ifc)+ 

by  dividing  /?2,  •  •  • ,  Pk-i  by  r  (Eq.  2.26)  and  computing 


(2.27) 


p2(ti  ~  tk)+  foih  -  tk)+  •  •  •  +  Pk-i(tk-2  -  4)]/(4  -  4-i)  (2.28) 
02^1  —  @3(^2  —  tk-l)+  •  •  •  +  /3k—l(tk—2  ~  tk- 1)]/ (t/c-1  —  tk)- 


A  test  of  linearity  in  X  can  be  obtained  by  testing 


H0  :  P2  =  P3  =  . . .  =  Afe-i  =  0.  (2.29) 

The  truncated  power  basis  for  restricted  cubic  splines  does  allow  for 
rational  (i.e.,  linear)  extrapolation  beyond  the  outer  knots.  However,  when 
the  outer  knots  are  in  the  tails  of  the  data,  extrapolation  can  still  be  danger¬ 
ous. 

When  nonlinear  terms  in  Equation  2.25  are  normalized,  for  example,  by 
dividing  them  by  the  square  of  the  difference  in  the  outer  knots  to  make  all 
terms  have  units  of  X,  the  ordinary  truncated  power  basis  has  no  numerical 
difficulties  when  modern  matrix  algebra  software  is  used. 


2.4-6  Choosing  Number  and  Position  of  Knots 

We  have  assumed  that  the  locations  of  the  knots  are  specified  in  advance; 
that  is,  the  knot  locations  are  not  treated  as  free  parameters  to  be  estimated. 
If  knots  were  free  parameters,  the  fitted  function  would  have  more  flexibility 
but  at  the  cost  of  instability  of  estimates,  statistical  inference  problems,  and 
inability  to  use  standard  regression  modeling  software  for  estimating  regres¬ 
sion  parameters. 

How  then  does  the  analyst  pre-assign  knot  locations?  If  the  regression 
relationship  were  described  by  prior  experience,  pre-specification  of  knot  lo¬ 
cations  would  be  easy.  For  example,  if  a  function  were  known  to  change 
curvature  at  X  =  a,  a  knot  could  be  placed  at  a.  However,  in  most  situations 
there  is  no  way  to  pre- specify  knots.  Fortunately,  Stone59  has  found  that 
the  location  of  knots  in  a  restricted  cubic  spline  model  is  not  very  crucial  in 
most  situations;  the  fit  depends  much  more  on  the  choice  of  fc,  the  number  of 
knots.  Placing  knots  at  fixed  quantiles  (percentiles)  of  a  predictor’s  marginal 
distribution  is  a  good  approach  in  most  datasets.  This  ensures  that  enough 
points  are  available  in  each  interval,  and  also  guards  against  letting  outliers 
overly  influence  knot  placement.  Recommended  equally  spaced  quantiles  are 
shown  in  Table  2.3. 
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3  knots 


4  knots 


5  knots  6  knots 

Fig.  2.3  Some  typical  restricted  cubic  spline  functions  for  k  =  3,  4,  5,  6.  The  ^/-axis 
is  X (3.  Arrows  indicate  knots.  These  curves  were  derived  by  randomly  choosing  values 
of  f3  subject  to  standard  deviations  of  fitted  functions  being  normalized. 


Table  2.3  Default  quantiles  for  knots 


k 

3 

4 

5 

6 
7 


.05 
.05  .23 
.025  .1833 


Quantiles 
TO  .5  .90 
.05  .35  .65  .95 

.275  .5  .725  .95 
.41  .59  .77  .95 

.3417  .5  .6583  .8167  .975 
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The  principal  reason  for  using  less  extreme  default  quantiles  for  k  =  3  and 
more  extreme  ones  for  k  =  7  is  that  one  usually  uses  k  =  3  for  small  sample 
sizes  and  k  =  7  for  large  samples.  When  the  sample  size  is  less  than  100,  the 
outer  quantiles  should  be  replaced  by  the  fifth  smallest  and  fifth  largest  data 
points,  respectively.595  What  about  the  choice  of  fc?  The  flexibility  of  possible 
fits  must  be  tempered  by  the  sample  size  available  to  estimate  the  unknown 
parameters.  Stone59  has  found  that  more  than  5  knots  are  seldom  required 
in  a  restricted  cubic  spline  model.  The  principal  decision  then  is  between 
k  =  3,4,  or  5.  For  many  datasets,  k  =  4  offers  an  adequate  fit  of  the  model 
and  is  a  good  compromise  between  flexibility  and  loss  of  precision  caused 
by  overfitting  a  small  sample.  When  the  sample  size  is  large  (e.g.,  n  >  100 
with  a  continuous  uncensored  response  variable),  k  —  5  is  a  good  choice. 
Small  samples  (<  30,  say)  may  require  the  use  of  k  —  3.  Akaike’s  information 
criterion  (AIC,  Section  9.8.1)  can  be  used  for  a  data-based  choice  of  k.  The 
value  of  k  maximizing  the  model  likelihood  ratio  y2  —  2k  would  be  the  best 
“for  the  money”  using  AIC. 

The  analyst  may  wish  to  devote  more  knots  to  variables  that  are  thought 
to  be  more  important,  and  risk  lack  of  fit  for  less  important  variables.  In  this 
way  the  total  number  of  estimated  parameters  can  be  controlled  (Section  4.1). 


2.4-7  Nonparametric  Regression 

One  of  the  most  important  results  of  an  analysis  is  the  estimation  of  the 
tendency  (trend)  of  how  X  relates  to  Y.  This  trend  is  useful  in  its  own  right 
and  it  may  be  sufficient  for  obtaining  predicted  values  in  some  situations,  but 
trend  estimates  can  also  be  used  to  guide  formal  regression  modeling  (by  sug¬ 
gesting  predictor  variable  transformations)  and  to  check  model  assumptions. 

Nonparametric  smoothers  are  excellent  tools  for  determining  the  shape 
of  the  relationship  between  a  predictor  and  the  response.  The  standard  non¬ 
parametric  smoothers  work  when  one  is  interested  in  assessing  one  continuous 
predictor  at  a  time  and  when  the  property  of  the  response  that  should  be  lin¬ 
early  related  to  the  predictor  is  a  standard  measure  of  central  tendency.  For 
example,  when  C(Y)  is  E(Y)  or  Pr[Y  =  1],  standard  smoothers  are  useful, 
but  when  C(Y)  is  a  measure  of  variability  or  a  rate  (instantaneous  risk),  or 
when  Y  is  only  incompletely  measured  for  some  subjects  (e.g.,  Y  is  censored 
for  some  subjects),  simple  smoothers  will  not  work. 

The  oldest  and  simplest  nonparametric  smoother  is  the  moving  average. 
Suppose  that  the  data  consist  of  the  points  X  =  1,2,  3,  5,  and  8,  with  the 
corresponding  Y  values  2.1,  3.8,  5.7, 11.1,  and  17.2.  To  smooth  the  relationship 
we  could  estimate  E(Y\X  =  2)  by  (2.1  +  3.8  +  5.7)/3  and  E(Y\X  =  (2  +  3  + 
5) / 3)  by  (3.8 +  5.7+  ll.l)/3.  Note  that  overlap  is  fine;  that  is  one  point  may 
be  contained  in  two  sets  that  are  averaged.  You  can  immediately  see  that  the 
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simple  moving  average  has  a  problem  in  estimating  E(Y )  at  the  outer  values 
of  X.  The  estimates  are  quite  sensitive  to  the  choice  of  the  number  of  points 
(or  interval  width)  to  use  in  “binning”  the  data. 

A  moving  least  squares  linear  regression  smoother  is  far  superior  to  a 
moving  flat  line  smoother  (moving  average).  Cleveland’s111  moving  linear 
regression  smoother  loess  has  become  the  most  popular  smoother.  To  obtain 
the  smoothed  value  of  Y  at  X  =  r,  we  take  all  the  data  having  X  values 
within  a  suitable  interval  about  x.  Then  a  linear  regression  is  fitted  to  all 
of  these  points,  and  the  predicted  value  from  this  regression  at  X  =  x  is 
taken  as  the  estimate  of  E(Y\X  =  x).  Actually,  loess  uses  weighted  least 
squares  estimates,  which  is  why  it  is  called  a  locally  weighted  least  squares 
method.  The  weights  are  chosen  so  that  points  near  X  —  x  are  given  the 
most  weightb  in  the  calculation  of  the  slope  and  intercept.  Surprisingly,  a 
good  default  choice  for  the  interval  about  x  is  an  interval  containing  2/3  of 
the  data  points!  The  weighting  function  is  devised  so  that  points  near  the 
extremes  of  this  interval  receive  almost  no  weight  in  the  calculation  of  the 
slope  and  intercept. 

Because  loess  uses  a  moving  straight  line  rather  than  a  moving  flat  one, 
it  provides  much  better  behavior  at  the  extremes  of  the  Xs.  For  example, 
one  can  fit  a  straight  line  to  the  first  three  data  points  and  then  obtain  the 
predicted  value  at  the  lowest  X,  which  takes  into  account  that  this  X  is  not 
the  middle  of  the  three  Xs. 

loess  obtains  smoothed  values  for  E(Y)  at  each  observed  value  of  X. 
Estimates  for  other  Xs  are  obtained  by  linear  interpolation. 

The  loess  algorithm  has  another  component.  After  making  an  initial  es¬ 
timate  of  the  trend  line,  loess  can  look  for  outliers  off  this  trend.  It  can 
then  delete  or  down-weight  those  apparent  outliers  to  obtain  a  more  robust 
trend  estimate.  Now,  different  points  will  appear  to  be  outliers  with  respect 
to  this  second  trend  estimate.  The  new  set  of  outliers  is  taken  into  account 
and  another  trend  line  is  derived.  By  default,  the  process  stops  after  these 
three  iterations,  loess  works  exceptionally  well  for  binary  Y  as  long  as  the 
iterations  that  look  for  outliers  are  not  done,  that  is  only  one  iteration  is 
performed. 

For  a  single  X,  Friedman’s  “super  smoother”20 7  is  another  efficient  and  flex¬ 
ible  nonparametric  trend  estimator.  For  both  loess  and  the  super  smoother 
the  amount  of  smoothing  can  be  controlled  by  the  analyst.  Hastie  and 
Tibshirani  5  provided  an  excellent  description  of  smoothing  methods  and 
developed  a  generalized  additive  model  for  multiple  Xs,  in  which  each 
continuous  predictor  is  fitted  with  a  nonparametric  smoother  (see  Chap¬ 
ter  16).  Interactions  are  not  allowed.  Cleveland  et  al.9(  have  extended  two- 
dimensional  smoothers  to  multiple  dimensions  without  assuming  additivity. 
Their  local  regression  model  is  feasible  for  up  to  four  or  so  predictors.  Local 
regression  models  are  extremely  flexible,  allowing  parts  of  the  model  to  be 


b  This  weight  is  not  to  be  confused  with  the  regression  coefficient;  rather  the  weights 

A 

are  w±,W2,  •  •  • ,  wn  and  the  fitting  criterion  is  voi(Y%  ~  X)2. 
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parametrically  specified,  and  allowing  the  analyst  to  choose  the  amount  of 
smoothing  or  the  effective  number  of  degrees  of  freedom  of  the  fit. 

Smoothing  splines  are  related  to  nonparametric  smoothers.  Here  a  knot 
is  placed  at  every  data  point,  but  a  penalized  likelihood  is  maximized  to 
derive  the  smoothed  estimates.  Gray237, 238  developed  a  general  method  that 
is  halfway  between  smoothing  splines  and  regression  splines.  He  pre-specified, 
say,  10  fixed  knots,  but  uses  a  penalized  likelihood  for  estimation.  This  allows 
the  analyst  to  control  the  effective  number  of  degrees  of  freedom  used. 

Besides  using  smoothers  to  estimate  regression  relationships,  smoothers  are 
valuable  for  examining  trends  in  residual  plots.  See  Sections  14.6  and  21.2 
for  examples. 


2-4-8  Advantages  of  Regression  Splines 
over  Other  Methods 

There  are  several  advantages  of  regression  splines:271 

1.  Parametric  splines  are  piecewise  polynomials  and  can  be  fitted  using  any 
existing  regression  program  after  the  constructed  predictors  are  computed. 
Spline  regression  is  equally  suitable  to  multiple  linear  regression,  survival 
models,  and  logistic  models  for  discrete  outcomes. 

2.  Regression  coefficients  for  the  spline  function  are  estimated  using  stan¬ 
dard  techniques  (maximum  likelihood  or  least  squares),  and  statistical 
inferences  can  readily  be  drawn.  Formal  tests  of  no  overall  association, 
linearity,  and  additivity  can  readily  be  constructed.  Confidence  limits  for 
the  estimated  regression  function  are  derived  by  standard  theory. 

3.  The  fitted  spline  function  directly  estimates  the  transformation  that  a 
predictor  should  receive  to  yield  linearity  in  C(Y\X).  The  fitted  spline 
transformation  sometimes  suggests  a  simple  transformation  (e.g.,  square 
root)  of  a  predictor  that  can  be  used  if  one  is  not  concerned  about  the 
proper  number  of  degrees  of  freedom  for  testing  association  of  the  predictor 
with  the  response. 

4.  The  spline  function  can  be  used  to  represent  the  predictor  in  the  final 
model.  Nonparametric  methods  do  not  yield  a  prediction  equation. 

5.  Splines  can  be  extended  to  non- additive  models  (see  below).  Multidimen¬ 
sional  nonparametric  estimators  often  require  burdensome  computations. 


2.5  Recursive  Partitioning:  Tree-Based  Models 

Breiman  et  al.  have  developed  an  essentially  model- free  approach  called 
classification  and  regression  trees  (CART),  a  form  of  recursive  partitioning. 


2.6  Multiple  Degree  of  Freedom  Tests  of  Association 


31 


For  some  implementations  of  CART,  we  say  “essentially”  model- free  since  a 
model-based  statistic  is  sometimes  chosen  as  a  splitting  criterion.  The  essence 
of  recursive  partitioning  is  as  follows. 

1.  Find  the  predictor  so  that  the  best  possible  binary  split  on  that  predictor 
has  a  larger  value  of  some  statistical  criterion  than  any  other  split  on  any 
other  predictor.  For  ordinal  and  continuous  predictors,  the  split  is  of  the 
form  X  <  c  versus  X  >  c.  For  polytomous  predictors,  the  split  involves 
finding  the  best  separation  of  categories,  without  preserving  order. 

2.  Within  each  previously  formed  subset,  find  the  best  predictor  and  best 
split  that  maximizes  the  criterion  in  the  subset  of  observations  passing  the 
previous  split. 

3.  Proceed  in  like  fashion  until  fewer  than  k  observations  remain  to  be  split, 
where  k  is  typically  20  to  100. 

4.  Obtain  predicted  values  using  a  statistic  that  summarizes  each  terminal 
node  (e.g.,  mean  or  proportion). 

5.  Prune  the  tree  backward  so  that  a  tree  with  the  same  number  of  nodes 
developed  on  0.9  of  the  data  validates  best  on  the  remaining  0.1  of  the 
data  (average  over  the  10  cross-validations).  Alternatively,  shrink  the  node 
estimates  toward  the  mean,  using  a  progressively  stronger  shrinkage  factor, 
until  the  best  cross-validation  results. 

Tree  models  have  the  advantage  of  not  requiring  any  functional  form  for 
the  predictors  and  of  not  assuming  additivity  of  predictors  (i.e.,  recursive 
partitioning  can  identify  complex  interactions).  Trees  can  deal  with  miss¬ 
ing  data  flexibly.  They  have  the  disadvantages  of  not  utilizing  continuous 
variables  effectively  and  of  overfitting  in  three  directions:  searching  for  best 
predictors,  for  best  splits,  and  searching  multiple  times.  The  penalty  for  the 
extreme  amount  of  data  searching  required  by  recursive  partitioning  surfaces 
when  the  tree  does  not  cross- validate  optimally  until  it  is  pruned  all  the  way 
back  to  two  or  three  splits.  Thus  reliable  trees  are  often  not  very  discrimi¬ 
nating. 

Tree  models  are  especially  useful  in  messy  situations  or  settings  in  which 
overfitting  is  not  so  problematic,  such  as  confounder  adjustment  using  propen¬ 
sity  scores11^  or  in  missing  value  imputation.  A  major  advantage  of  tree  mod¬ 
eling  is  savings  of  analyst  time,  but  this  is  offset  by  the  underfitting  needed 
to  make  trees  validate. 


2.6  Multiple  Degree  of  Freedom  Tests  of  Association 

When  a  factor  is  a  linear  or  binary  term  in  the  regression  model,  the  test 
of  association  for  that  factor  with  the  response  involves  testing  only  a  single 
regression  parameter.  Nominal  factors  and  predictors  that  are  represented  as 
a  quadratic  or  spline  function  require  multiple  regression  parameters  to  be 
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tested  simultaneously  in  order  to  assess  association  with  the  response.  For  a 
nominal  factor  having  k  levels,  the  overall  ANOVA-type  test  with  k  —  1  d.f. 
tests  whether  there  are  any  differences  in  responses  between  the  k  categories. 
It  is  recommended  that  this  test  be  done  before  attempting  to  interpret  in¬ 
dividual  parameter  estimates.  If  the  overall  test  is  not  significant,  it  can  be 
dangerous  to  rely  on  individual  pairwise  comparisons  because  the  type  I  error 
will  be  increased.  Likewise,  for  a  continuous  predictor  for  which  linearity  is 
not  assumed,  all  terms  involving  the  predictor  should  be  tested  simultane¬ 
ously  to  check  whether  the  factor  is  associated  with  the  outcome.  This  test 
should  precede  the  test  for  linearity  and  should  usually  precede  the  attempt 
to  eliminate  nonlinear  terms.  For  example,  in  the  model 

C(Y \X)  =/3o  +  PiX\  +  (32X2  +  (33X  l  (2.30) 

one  should  test  Ho  :  fa  =  fa  =  0  with  2  d.f.  to  assess  association  between 

X2  and  outcome.  In  the  five-knot  restricted  cubic  spline  model 

C(Y\X)  =  fa  +  faX  +  fax'  +  fax"  +  fax'",  (2.31) 

the  hypothesis  Hq  :  fa  =  . . .  =  fa  =  0  should  be  tested  with  4  d.f.  to 

assess  whether  there  is  any  association  between  X  and  Y .  If  this  4  d.f.  test  is 
insignificant,  it  is  dangerous  to  interpret  the  shape  of  the  fitted  spline  function 
because  the  hypothesis  that  the  overall  function  is  flat  has  not  been  rejected. 

A  dilemma  arises  when  an  overall  test  of  association,  say  one  having  4 
d.f.,  is  insignificant,  the  3  d.f.  test  for  linearity  is  insignificant,  but  the  1  d.f. 
test  for  linear  association,  after  deleting  nonlinear  terms,  becomes  significant. 
Had  the  test  for  linearity  been  borderline  significant,  it  would  not  have  been 
warranted  to  drop  these  terms  in  order  to  test  for  a  linear  association.  But 
with  the  evidence  for  nonlinearity  not  very  great,  one  could  attempt  to  test 
for  association  with  1  d.f.  This  however  is  not  fully  justified,  because  the  1 
d.f.  test  statistic  does  not  have  a  y2  distribution  with  1  d.f.  since  pretesting 
was  done.  The  original  4  d.f.  test  statistic  does  have  a  y2  distribution  with  4 
d.f.  because  it  was  for  a  pre-specified  test. 

For  quadratic  regression,  Grambsch  and  O’Brien  4  showed  that  the  2 
d.f.  test  of  association  is  nearly  optimal  when  pretesting  is  done,  even  when 
the  true  relationship  is  linear.  They  considered  an  ordinary  regression  model 
E(Y \X)  =  fa  +  faX  +  faX2  and  studied  tests  of  association  between  X  and 
Y.  The  strategy  they  studied  was  as  follows.  First,  fit  the  quadratic  model 
and  obtain  the  partial  test  of  Hq  :  fa  =  0,  that  is  the  test  of  linearity.  If  this 
partial  F-test  is  significant  at  the  a  =  0.05  level,  report  as  the  final  test  of 
association  between  X  and  Y  the  2  d.f.  F-test  of  Hq  :  fa  =  fa  =  0.  If  the 
test  of  linearity  is  insignificant,  the  model  is  refitted  without  the  quadratic 
term  and  the  test  of  association  is  then  a  1  d.f.  test,  Hq  :  fa  =  0\fa  =  0. 
Grambsch  and  O’Brien  demonstrated  that  the  type  I  error  from  this  two- 
stage  test  is  greater  than  the  stated  a,  and  in  fact  a  fairly  accurate  P- value 
can  be  obtained  if  it  is  computed  from  an  F  distribution  with  2  numerator 
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d.f.  even  when  testing  at  the  second  stage.  This  is  because  in  the  original 
2  d.f.  test  of  association,  the  1  d.f.  corresponding  to  the  nonlinear  effect  is 
deleted  if  the  nonlinear  effect  is  very  small;  that  is  one  is  retaining  the  most 
significant  part  of  the  2  d.f.  F  statistic. 

If  we  use  a  2  d.f.  F  critical  value  to  assess  the  X  effect  even  when  X 2  is  not 
in  the  model,  it  is  clear  that  the  two-stage  approach  can  only  lose  power  and 
hence  it  has  no  advantage  whatsoever.  That  is  because  the  sum  of  squares 
due  to  regression  from  the  quadratic  model  is  greater  than  the  sum  of  squares 
computed  from  the  linear  model. 


2.7  Assessment  of  Model  Fit 
2.7.1  Regression  Assumptions 

In  this  section,  the  regression  part  of  the  model  is  isolated,  and  methods  are 
described  for  validating  the  regression  assumptions  or  modifying  the  model 
to  meet  the  assumptions.  The  general  linear  regression  model  is 


C(Y \X)  =  X/3  =  p0  +  fcXx  +  f32X2  +  . . .  +  f3kXk.  (2.32) 

The  assumptions  of  linearity  and  additivity  need  to  be  verified.  We  begin 
with  a  special  case  of  the  general  model, 

C(Y\X)  =  p0  +  foXi  +  (32X2 ,  (2.33) 

where  X\  is  binary  and  X2  is  continuous.  One  needs  to  verify  that  the  prop¬ 
erty  of  the  response  C(Y)  is  related  to  X\  and  X2  according  to  Figure  2.4. 

There  are  several  methods  for  checking  the  fit  of  this  model.  The  first 
method  below  is  based  on  critiquing  the  simple  model,  and  the  other  methods 
directly  “estimate”  the  model. 

1.  Fit  the  simple  linear  additive  model  and  critically  examine  residual  plots 
for  evidence  of  systematic  patterns.  For  least  squares  fits  one  can  compute 

A 

estimated  residuals  e  =  Y  —  X/3  and  box  plots  of  e  stratified  by  X\  and 

/\ 

scatterplots  of  e  versus  X\  and  Y  with  trend  curves.  If  one  is  assuming 
constant  conditional  variance  of  Y~,  the  spread  of  the  residual  distribution 
against  each  of  the  variables  can  be  checked  at  the  same  time.  If  the  nor¬ 
mality  assumption  is  needed  (i.e.,  if  significance  tests  or  confidence  limits 
are  used),  the  distribution  of  e  can  be  compared  with  a  normal  distribu¬ 
tion  with  mean  zero.  Advantage:  Simplicity.  Disadvantages:  Standard 
residuals  can  only  be  computed  for  continuous  uncensored  response  vari¬ 
ables.  The  judgment  of  non-randomness  is  largely  subjective,  it  is  difficult 
to  detect  interaction,  and  if  interaction  is  present  it  is  difficult  to  check 
any  of  the  other  assumptions.  Unless  trend  lines  are  added  to  plots,  pat- 
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x2 

Fig.  2.4  Regression  assumptions  for  one  binary  and  one  continuous  predictor 


terns  may  be  difficult  to  discern  if  the  sample  size  is  very  large.  Detecting 
patterns  in  residuals  does  not  always  inform  the  analyst  of  what  corrective 
action  to  take,  although  partial  residual  plots  can  be  used  to  estimate  the 
needed  transformations  if  interaction  is  absent. 

2.  Make  a  scatterplot  of  Y  versus  X 2  using  different  symbols  according  to 
values  of  X\.  Advantages:  Simplicity,  and  one  can  sometimes  see  all  re¬ 
gression  patterns  including  interaction.  Disadvantages:  Scatterplots  can¬ 
not  be  drawn  for  binary,  categorical,  or  censored  Y.  Patterns  are  difficult 
to  see  if  relationships  are  weak  or  if  the  sample  size  is  very  large. 

3.  Stratify  the  sample  by  X\  and  quantile  groups  (e.g.,  deciles)  of  X2.  Within 
each  X\  x  X2  stratum  an  estimate  of  C(Y |Xl, X2)  is  computed.  If  X\  is 
continuous,  the  same  method  can  be  used  after  grouping  X\  into  quantile 
groups.  Advantages:  Simplicity,  ability  to  see  interaction  patterns,  can 
handle  censored  Y  if  care  is  taken.  Disadvantages:  Subgrouping  requires 
relatively  large  sample  sizes  and  does  not  use  continuous  factors  effectively 
as  it  does  not  attempt  any  interpolation.  The  ordering  of  quantile  groups  is 
not  utilized  by  the  procedure.  Subgroup  estimates  have  low  precision  (see 
p.  488  for  an  example).  Each  stratum  must  contain  enough  information 
to  allow  trends  to  be  apparent  above  noise  in  the  data.  The  method  of 
grouping  chosen  (e.g.,  deciles  vs.  quintiles  vs.  rounding)  can  alter  the  shape 
of  the  plot. 

4.  Fit  a  nonparametric  smoother  separately  for  levels  of  X\  (Section  2.4.7) 
relating  X2  to  Y .  Advantages:  All  regression  aspects  of  the  model  can 
be  summarized  efficiently  with  minimal  assumptions.  Disadvantages: 
Does  not  easily  apply  to  censored  T,  and  does  not  easily  handle  multiple 
predictors. 
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5.  Fit  a  flexible  parametric  model  that  allows  for  most  of  the  departures  from 
the  linear  additive  model  that  you  wish  to  entertain.  Advantages:  One 
framework  is  used  for  examining  the  model  assumptions,  fitting  the  model, 
and  drawing  formal  inference.  Degrees  of  freedom  are  well  defined  and 
all  aspects  of  statistical  inference  “work  as  advertised.”  Disadvantages: 
Complexity,  and  it  is  generally  difficult  to  allow  for  interactions  when 
assessing  patterns  of  effects. 

The  first  four  methods  each  have  the  disadvantage  that  if  confidence  limits 
or  formal  inferences  are  desired  it  is  difficult  to  know  how  many  degrees  of 
freedom  were  effectively  used  so  that,  for  example,  confidence  limits  will  have 
the  stated  coverage  probability.  For  method  five,  the  restricted  cubic  spline 
function  is  an  excellent  tool  for  estimating  the  true  relationship  between  X 2 
and  C(Y)  for  continuous  variables  without  assuming  linearity.  By  fitting  a 
model  containing  X2  expanded  into  k  —  1  terms,  where  k  is  the  number  of 
knots,  one  can  obtain  an  estimate  of  the  function  of  X2  that  could  be  used 
linearly  in  the  model: 


C(Y \X)  =  A)  +  P1X1  +  P2X2  +  PsX'2  +  pAX% 

=  Po  +  PiX1  +  f(X2),  (2.34) 

where 

f(X2)  =  P2X2  +  hx’2  +  04*2  ,  (2-35) 

and  X2  and  X2  are  constructed  spline  variables  (when  k  =  4)  as  described 

/\ 

previously.  We  call  /(X2)  the  spline-estimated  transformation  of  X2.  Plotting 

/\ 

the  estimated  spline  function  /(X2)  versus  X2  will  generally  shed  light  on 
how  the  effect  of  X2  should  be  modeled.  If  the  sample  is  sufficiently  large, 
the  spline  function  can  be  fitted  separately  for  X\  =  0  and  X\  =  1,  allowing 
detection  of  even  unusual  interaction  patterns.  A  formal  test  of  linearity  in 
X2  is  obtained  by  testing  Hq  :  ^3  =  ^4  =  0,  using  a  computationally  efficient 
score  test,  for  example  (Section  9.2.3). 

If  the  model  is  nonlinear  in  X2,  either  a  transformation  suggested  by  the 
spline  function  plot  (e.g.,  log(X2))  or  the  spline  function  itself  (by  placing 
X2,  X2,  and  X2  simultaneously  in  any  model  fitted)  may  be  used  to  describe 
X2  in  the  model.  If  a  tentative  transformation  of  X2  is  specified,  say  g(X ’2), 
the  adequacy  of  this  transformation  can  be  tested  by  expanding  g{X2)  in  a 
spline  function  and  testing  for  linearity.  If  one  is  concerned  only  with  predic¬ 
tion  and  not  with  statistical  inference,  one  can  attempt  to  find  a  simplifying 
transformation  for  a  predictor  by  plotting  g{X 2)  against  /(X2)  (the  estimated 
spline  transformation)  for  a  variety  of  g,  seeking  a  linearizing  transformation 
of  X2.  When  there  are  nominal  or  binary  predictors  in  the  model  in  addi¬ 
tion  to  the  continuous  predictors,  it  should  be  noted  that  there  are  no  shape 
assumptions  to  verify  for  the  binary/nominal  predictors.  One  need  only  test 
for  interactions  between  these  predictors  and  the  others. 
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If  the  model  contains  more  than  one  continuous  predictor,  all  may  be  ex¬ 
panded  with  spline  functions  in  order  to  test  linearity  or  to  describe  nonlinear 
relationships.  If  one  did  desire  to  assess  simultaneously,  for  example,  the  lin¬ 
earity  of  predictors  X 2  and  X3  in  the  presence  of  a  linear  or  binary  predictor 
Xi,  the  model  could  be  specified  as 


C(Y \X)  =  (30  +  Pi  V  +  P2X2  +  p3X'2  +  /?4X" 

+  p5X3  +  p6X'3  +  /37X",  (2.36) 

where  X2 ,  X2 ,  *3,  and  X'l  represent  components  of  four  knot  restricted  cubic 
spline  functions. 

The  test  of  linearity  for  X2  (with  2  d.f.)  is  #0  :  03  =  04  =  0.  The  overall 
test  of  linearity  for  X2  and  X3  is  #0  :  03  =  04  =  06  =  07  =  0,  with  4  d.f. 
But  as  described  further  in  Section  4.1,  even  though  there  are  many  reasons 
for  allowing  relationships  to  be  nonlinear,  there  are  reasons  for  not  testing 
the  nonlinear  components  for  significance,  as  this  might  tempt  the  analyst  to 
simplify  the  model  thus  distorting  inference.234  Testing  for  linearity  is  usually 
best  done  to  justify  to  non-statisticians  the  need  for  complexity  to  explain  or 
predict  outcomes. 
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2.7.2  Modeling  and  Testing  Complex  Interactions 

For  testing  interaction  between  X\  and  X2  (after  a  needed  transformation 
may  have  been  applied),  often  a  product  term  (e.g.,  X\X<i)  can  be  added 
to  the  model  and  its  coefficient  tested.  A  more  general  simultaneous  test  of 
linearity  and  lack  of  interaction  for  a  two- variable  model  in  which  one  variable 
is  binary  (or  is  assumed  linear)  is  obtained  by  fitting  the  model 

C(Y\X)  =  0O  +  A  *1  +  02*2  +  03*2  +  04*2  (2-37) 

+  05XiX2  +  06*1*2  +  07*1*2 

and  testing  Hq  :  03  =  . . .  =  07  =  0.  This  formulation  allows  the  shape  of  the 
X2  effect  to  be  completely  different  for  each  level  of  X\.  There  is  virtually 
no  departure  from  linearity  and  additivity  that  cannot  be  detected  from  this 
expanded  model  formulation  if  the  number  of  knots  is  adequate  and  X\  is 
binary.  For  binary  logistic  models,  this  method  is  equivalent  to  fitting  two 
separate  spline  regressions  in  X2. 

Interactions  can  be  complex  when  all  variables  are  continuous.  An  ap¬ 
proximate  approach  is  to  reduce  the  variables  to  two  transformed  variables, 
in  which  case  interaction  may  sometimes  be  approximated  by  a  single  product 
of  the  two  new  variables.  A  disadvantage  of  this  approach  is  that  the  esti¬ 
mates  of  the  transformations  for  the  two  variables  will  be  different  depending 
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on  whether  interaction  terms  are  adjusted  for  when  estimating  “main  effects.” 
A  good  compromise  method  involves  fitting  interactions  of  the  form  Xif(X 2) 
and  X2g(Xi): 


C(Y\X)  =  fa  +  P1X1  +  /32X[  +  foX'l 
+  faX2  +  fcX'2  +  peX% 

+  PjX^  +  kXxX'2  +  faXxX'i  (2.38) 

+  /3io^2X{  +  PuX2X'; 

(for  k  =  4  knots  for  both  variables).  The  test  of  additivity  is  Hq  :  ^7  =  /3g  = 

. . .  =  Pn  =  0  with  5  d.f.  A  test  of  lack  of  fit  for  the  simple  product  interaction 
with  X2  is  Hq  :  (3%  =  /3q  =  0,  and  a  test  of  lack  of  fit  for  the  simple  product 
interaction  with  X\  is  Hq  :  j3 10  =  /?n  =  0. 

A  general  way  to  model  and  test  interactions,  although  one  requiring  a 
larger  number  of  parameters  to  be  estimated,  is  based  on  modeling  the  X\  x 
X2  x  Y  relationship  with  a  smooth  three-dimensional  surface.  A  cubic  spline 
surface  can  be  constructed  by  covering  the  X\  —  X2  plane  with  a  grid  and 
fitting  a  patch- wise  cubic  polynomial  in  two  variables.  The  grid  is  (iq,  Vj\i  = 
1, . . . ,  &,  j  =  1  where  knots  for  X\  are  (rq, . . . ,  Uk)  and  knots  for  X2 

are  (iq, . . . ,  Vk)-  The  number  of  parameters  can  be  reduced  by  constraining 
the  surface  to  be  of  the  form  aX  1  +  bX2  +  cX\X2  in  the  lower  left  and 
upper  right  corners  of  the  plane.  The  resulting  restricted  cubic  spline  surface 
is  described  by  a  multiple  regression  model  containing  spline  expansions  in 
X\  and  X2  and  all  cross-products  of  the  restricted  cubic  spline  components 
(e.g.,  XiXr2).  If  the  same  number  of  knots  k  is  used  for  both  predictors, 
the  number  of  interaction  terms  is  (k  —  l)2.  Examples  of  various  ways  of 
modeling  interaction  are  given  in  Chapter  10.  Spline  functions  made  up  of 
cross-products  of  all  terms  of  individual  spline  functions  are  called  tensor 
splines.50’ 274 

The  presence  of  more  than  two  predictors  increases  the  complexity  of  tests 
for  interactions  because  of  the  number  of  two-way  interactions  and  because 
of  the  possibility  of  interaction  effects  of  order  higher  than  two.  For  example, 
in  a  model  containing  age,  sex,  and  diabetes,  the  important  interaction  could 
be  that  older  male  diabetics  have  an  exaggerated  risk.  However,  higher-order 
interactions  are  often  ignored  unless  specified  a  priori  based  on  knowledge  of 
the  subject  matter.  Indeed,  the  number  of  two-way  interactions  alone  is  often 
too  large  to  allow  testing  them  all  with  reasonable  power  while  controlling 
multiple  comparison  problems.  Often,  the  only  two-way  interactions  we  can 
afford  to  test  are  those  that  were  thought  to  be  important  before  examining 
the  data.  A  good  approach  is  to  test  for  all  such  pre-specified  interaction 
effects  with  a  single  global  (pooled)  test.  Then,  unless  interactions  involving 
only  one  of  the  predictors  are  of  special  interest,  one  can  either  drop  all 
interactions  or  retain  all  of  them. 
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For  some  problems  a  reasonable  approach  is,  for  each  predictor  separately, 
to  test  simultaneously  the  joint  importance  of  all  interactions  involving  that 
predictor.  For  p  predictors  this  results  in  p  tests  each  with  p  —  1  degrees 
of  freedom.  The  multiple  comparison  problem  would  then  be  reduced  from 
p(p  —  l)/2  tests  (if  all  two-way  interactions  were  tested  individually)  to  p 
tests. 

In  the  fields  of  biostatistics  and  epidemiology,  some  types  of  interactions 
that  have  consistently  been  found  to  be  important  in  predicting  outcomes 
and  thus  may  be  pre-specified  are  the  following. 

1.  Interactions  between  treatment  and  the  severity  of  disease  being  treated. 
Patients  with  little  disease  can  receive  little  benefit. 

2.  Interactions  involving  age  and  risk  factors.  Older  subjects  are  generally 
less  affected  by  risk  factors.  They  had  to  have  been  robust  to  survive  to 
their  current  age  with  risk  factors  present. 

3.  Interactions  involving  age  and  type  of  disease.  Some  diseases  are  incurable 
and  have  the  same  prognosis  regardless  of  age.  Others  are  treatable  or 
have  less  effect  on  younger  patients. 

4.  Interactions  between  a  measurement  and  the  state  of  a  subject  during  a 
measurement.  Respiration  rate  measured  during  sleep  may  have  greater 
predictive  value  and  thus  have  a  steeper  slope  versus  outcome  than  res¬ 
piration  rate  measured  during  activity. 

5.  Interaction  between  menopausal  status  and  treatment  or  risk  factors. 

6.  Interactions  between  race  and  disease. 

7.  Interactions  between  calendar  time  and  treatment.  Some  treatments  have 
learning  curves  causing  secular  trends  in  the  associations. 

8.  Interactions  between  month  of  the  year  and  other  predictors,  due  to  sea¬ 
sonal  effects. 

9.  Interaction  between  the  quality  and  quantity  of  a  symptom,  for  example, 
daily  frequency  of  chest  pain  x  severity  of  a  typical  pain  episode. 

10.  Interactions  between  study  center  and  treatment. 


2.7.3  Fitting  Ordinal  Predictors 

For  the  case  of  an  ordinal  predictor,  spline  functions  are  not  useful  unless 
there  are  so  many  categories  that  in  essence  the  variable  is  continuous.  When 
the  number  of  categories  k  is  small  (three  to  five,  say),  the  variable  is  usu¬ 
ally  modeled  as  a  polytomous  factor  using  indicator  variables  or  equivalently 
as  one  linear  term  and  k  —  2  indicators.  The  latter  coding  facilitates  testing 
for  linearity.  For  more  categories,  it  may  be  reasonable  to  stratify  the  data 
by  levels  of  the  variable  and  to  compute  summary  statistics  (e.g.,  logit  pro¬ 
portions  for  a  logistic  model)  or  to  examine  regression  coefficients  associated 
with  indicator  variables  over  categories.  Then  one  can  attempt  to  summarize 
the  pattern  with  a  linear  or  some  other  simple  trend.  Later  hypothesis  tests 
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must  take  into  account  this  data-driven  scoring  (by  using  >  1  d.f.,  for  exam¬ 
ple),  but  the  scoring  can  save  degrees  of  freedom  when  testing  for  interaction 
with  other  factors.  In  one  dataset,  the  number  of  comorbid  diseases  was  used 
to  summarize  the  risk  of  a  set  of  diseases  that  was  too  large  to  model.  By 
plotting  the  logit  of  the  proportion  of  deaths  versus  the  number  of  diseases, 
it  was  clear  that  the  square  of  the  number  of  diseases  would  properly  score 
the  variables. 

Sometimes  it  is  useful  to  code  an  ordinal  predictor  with  k  —  1  indicator 
variables  of  the  form  [X  >  Vj],  where  j  =  2, . . . ,  k  and  [h\  is  1  if  h  is  true, 
0  otherwise.648  Although  a  test  of  linearity  does  not  arise  immediately  from 
this  coding,  the  regression  coefficients  are  interpreted  as  amounts  of  change 
from  the  previous  category.  A  test  of  whether  the  last  m  categories  can  be 
combined  with  the  category  k  —  m  does  follow  easily  from  this  coding. 


2.7.4  Distributional  Assumptions 

The  general  linear  regression  model  is  stated  as  C(Y \X)  =  Xp  to  highlight  its 
regression  assumptions.  For  logistic  regression  models  for  binary  or  nominal 
responses,  there  is  no  distributional  assumption  if  simple  random  sampling 
is  used  and  subjects’  responses  are  independent.  That  is,  the  binary  logistic 
model  and  all  of  its  assumptions  are  contained  in  the  expression  logit- {Y  = 
l\X}  =  X/3.  For  ordinary  multiple  regression  with  constant  variance  cr2,  we 

usually  assume  that  Y  —  X[3  is  normally  distributed  with  mean  0  and  variance 

/\ 

a2 .  This  assumption  can  be  checked  by  estimating  ft  with  ft  and  plotting  the 
overall  distribution  of  the  residuals  Y  —  X/3 ,  the  residuals  against  T,  and  the 

residuals  against  each  X.  For  the  latter  two,  the  residuals  should  be  normally 

/\ 

distributed  within  each  neighborhood  of  Y  or  X.  A  weaker  requirement  is  that 
the  overall  distribution  of  residuals  is  normal;  this  will  be  satisfied  if  all  of  the 
stratified  residual  distributions  are  normal.  Note  a  hidden  assumption  in  both 
models,  namely,  that  there  are  no  omitted  predictors.  Other  models,  such  as 
the  Weibull  survival  model  or  the  Cox132  proportional  hazards  model,  also 
have  distributional  assumptions  that  are  not  fully  specified  by  C(Y \X)  =  X  ft. 
However,  regression  and  distributional  assumptions  of  some  of  these  models 
are  encapsulated  by 

C(Y\X)  =  C(Y  =  y\X)  =  d(y )  +  X/3  (2.39) 

for  some  choice  of  C.  Here  C(Y  =  y\X)  is  a  property  of  the  response  Y 
evaluated  at  Y  =  y,  given  the  predictor  values  X,  and  d(y)  is  a  component  of 
the  distribution  of  Y.  For  the  Cox  proportional  hazards  model,  C(Y  =  y\X) 
can  be  written  as  the  log  of  the  hazard  of  the  event  at  time  ?/,  or  equivalently 
as  the  log  of  the  —  log  of  the  survival  probability  at  time  y,  and  d(y)  can  be 
thought  of  as  a  log  hazard  function  for  a  “standard”  subject. 
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If  we  evaluated  the  property  C(Y  =  y\X)  at  predictor  values  X 1  and  X2, 
the  difference  in  properties  is 


C(Y  =  ylX1)  -  C'(Y  =  y\X2)  =  d(y)  +  X1? 

-  [d(y)  +  X2p 
=  (X1-X2)/3, 


(2.40) 


which  is  independent  of  y.  One  way  to  verify  part  of  the  distributional  as¬ 
sumption  is  to  estimate  C(Y  —  y \Xl)  and  C(Y  =  y\X2)  for  set  values  of 
X1  and  X2  using  a  method  that  does  not  make  the  assumption,  and  to  plot 
C(Y  =  y\Xx)  —  C{Y  =  y\X2)  versus  y.  This  function  should  be  flat  if  the 
distributional  assumption  holds.  The  assumption  can  be  tested  formally  if 
d(y)  can  be  generalized  to  be  a  function  of  X  as  well  as  y.  A  test  of  whether 
d(y\X)  depends  on  X  is  a  test  of  one  part  of  the  distributional  assumption. 
For  example,  writing  d(y \X)  =  d(y)  +  Xr\og(y)  where 


xr  =  AAi  +  r2x2  + . . .  +  rkxk  (2.41) 

and  testing  Ho  :  i~i  =  . . .  =  Tk  =  0  is  one  way  to  test  whether  d(y\X)  de¬ 
pends  on  X.  For  semiparametric  models  such  as  the  Cox  proportional  hazards 
model,  the  only  distributional  assumption  is  the  one  stated  above,  namely, 
that  the  difference  in  properties  between  two  subjects  depends  only  on  the  dif¬ 
ference  in  the  predictors  between  the  two  subjects.  Other,  parametric,  models 
assume  in  addition  that  the  property  C(Y  =  y\X)  has  a  specific  shape  as  a 
function  of  y ,  that  is  that  d(y)  has  a  specific  functional  form.  For  example, 
the  Weibull  survival  model  has  a  specific  assumption  regarding  the  shape  of 
the  hazard  or  survival  distribution  as  a  function  of  y. 

Assessments  of  distributional  assumptions  are  best  understood  by  applying 
these  methods  to  individual  models  as  is  demonstrated  in  later  chapters. 
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References  [152,  575,  578]  have  more  information  about  cubic  splines. 

See  Smith5 7 8  for  a  good  overview  of  spline  functions. 

More  material  about  natural  splines  may  be  found  in  de  Boor152.  McNeil  et  al.451 
discuss  the  overall  smoothness  of  natural  splines  in  terms  of  the  integral  of  the 
square  of  the  second  derivative  of  the  regression  function,  over  the  range  of 
the  data.  Govindarajulu  et  al.230  compared  restricted  cubic  splines,  penalized 
splines,  and  fractional  polynomial532  fits  and  found  that  the  first  two  methods 
agreed  with  each  other  more  than  with  estimated  fractional  polynomials. 

A  tutorial  on  restricted  cubic  splines  is  in  [271]. 

Durrleman  and  Simon168  provide  examples  in  which  knots  are  allowed  to  be 
estimated  as  free  parameters,  jointly  with  the  regression  coefficients.  They  found 
that  even  though  the  “optimal”  knots  were  often  far  from  a  priori  knot  locations, 
the  model  fits  were  virtually  identical. 


2.8  Further  Reading 


41 


6 


7 


8 


9 

10 


11 


12 


Contrast  Hastie  and  Tibshirani’s  generalized  nonparametric  additive  models275 
with  Stone  and  Koo’s595  additive  model  in  which  each  continuous  predictor  is 
represented  with  a  restricted  cubic  spline  function. 

Gray237,238  provided  some  comparisons  with  ordinary  regression  splines,  but  he 
compared  penalized  regression  splines  with  non-restricted  splines  with  only  two 
knots.  Two  knots  were  chosen  so  as  to  limit  the  degrees  of  freedom  needed  by  the 
regression  spline  method  to  a  reasonable  number.  Gray  argued  that  regression 
splines  are  sensitive  to  knot  locations,  and  he  is  correct  when  only  two  knots 
are  allowed  and  no  linear  tail  restrictions  are  imposed.  Two  knots  also  prevent 
the  (ordinary  maximum  likelihood)  fit  from  utilizing  some  local  behavior  of 
the  regression  relationship.  For  penalized  likelihood  estimation  using  B-splines, 
Gray238  provided  extensive  simulation  studies  of  type  I  and  II  error  for  testing 
association  in  which  the  true  regression  function,  number  of  knots,  and  amount 
of  likelihood  penalization  were  varied.  He  studied  both  normal  regression  and 
Cox  regression. 

Breiman  et  al.’s  original  CART  method69  used  the  Gini  criterion  for  splitting. 
Later  work  has  used  log-likelihoods.109  Segal,562  LeBlanc  and  Crowley,389  and 
Ciampi  et  al.107, 108  and  Kele§  and  Segal342 have  extended  recursive  partitioning 
to  censored  survival  data  using  the  log-rank  statistic  as  the  criterion.  Zhang682 
extended  tree  models  to  handle  multivariate  binary  responses.  Schmoor  et  al.556 
used  a  more  general  splitting  criterion  that  is  useful  in  therapeutic  trials,  namely, 
a  Cox  test  for  main  and  interacting  effects.  Davis  and  Anderson149  used  an 
exponential  survival  model  as  the  basis  for  tree  construction.  Ahn  and  Loh' 
developed  a  Cox  proportional  hazards  model  adaptation  of  recursive  partition¬ 
ing  along  with  bootstrap  and  cross-validation-based  methods  to  protect  against 
“over-splitting.”  The  Cox-based  regression  tree  methods  of  Ciampi  et  al.10'  have 
a  unique  feature  that  allows  for  construction  of  “treatment  interaction  trees” 
with  hierarchical  adjustment  for  baseline  variables.  Zhang  et  al.683  provided  a 
new  method  for  handling  missing  predictor  values  that  is  simpler  than  using 
surrogate  splits.  See  [34,140,270,629]  for  examples  using  recursive  partitioning 
for  binary  responses  in  which  the  prediction  trees  did  not  validate  well. 

443,629  discos  other  problems  with  tree  models. 

For  ordinary  linear  models,  the  regression  estimates  are  the  same  as  obtained 
with  separate  fits,  but  standard  errors  are  different  (since  a  pooled  standard 
error  is  used  for  the  combined  fit).  For  Cox132  regression,  separate  fits  can  be 
slightly  different  since  each  subset  would  use  a  separate  ranking  of  Y . 

Gray’s  penalized  fixed-knot  regression  splines  can  be  useful  for  estimating  joint 
effects  of  two  continuous  variables  while  allowing  the  analyst  to  control  the 
effective  number  of  degrees  of  freedom  in  the  fit  [237,238,  Section  3.2].  When 
Y  is  a  non-censored  variable,  the  local  regression  model  of  Cleveland  et  ah, 96 
a  multidimensional  scatterplot  smoother  mentioned  in  Section  2.4.7,  provides  a 
good  graphical  assessment  of  the  joint  effects  of  several  predictors  so  that  the 
forms  of  interactions  can  be  chosen.  See  Wang  et  al.653  and  Gustafson248  for 
several  other  flexible  approaches  to  analyzing  interactions  among  continuous 
variables. 

Study  site  by  treatment  interaction  is  often  the  interaction  that  is  worried  about 
the  most  in  multi-center  randomized  clinical  trials,  because  regulatory  agencies 
are  concerned  with  consistency  of  treatment  effects  over  study  centers.  However, 
this  type  of  interaction  is  usually  the  weakest  and  is  difficult  to  assess  when 
there  are  many  centers  due  to  the  number  of  interaction  parameters  to  estimate. 
Schemper545  discusses  various  types  of  interactions  and  a  general  nonparametric 
test  for  interaction. 
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2.9  Problems 

For  problems  1  to  3,  state  each  model  statistically,  identifying  each  predictor 
with  one  or  more  component  variables.  Identify  and  interpret  each  regression 
parameter  except  for  coefficients  of  nonlinear  terms  in  spline  functions.  State 
each  hypothesis  below  as  a  formal  statistical  hypothesis  involving  the  proper 
parameters,  and  give  the  (numerator)  degrees  of  freedom  of  the  test.  State 
alternative  hypotheses  carefully  with  respect  to  unions  or  intersections  of 
conditions  and  list  the  type  of  alternatives  to  the  null  hypothesis  that  the 
test  is  designed  to  detect.0 

1.  A  property  of  Y  such  as  the  mean  is  linear  in  age  and  blood  pressure 
and  there  may  be  an  interaction  between  the  two  predictors.  Test  Hq  : 
there  is  no  interaction  between  age  and  blood  pressure.  Also  test  Hq  : 
blood  pressure  is  not  associated  with  Y  (in  any  fashion).  State  the  effect 
of  blood  pressure  as  a  function  of  age,  and  the  effect  of  age  as  a  function 
of  blood  pressure. 

2.  Consider  a  linear  additive  model  involving  three  treatments  (control,  drug 
Z,  and  drug  Q)  and  one  continuous  adjustment  variable,  age.  Test  Ho  : 
treatment  group  is  not  associated  with  response,  adjusted  for  age.  Also 
test  Hq  :  response  for  drug  Z  has  the  same  property  as  the  response  for 
drug  Q,  adjusted  for  age. 

3.  Consider  models  each  with  two  predictors,  temperature  and  white  blood 
count  (WBC),  for  which  temperature  is  always  assumed  to  be  linearly 
related  to  the  appropriate  property  of  the  response,  and  WBC  may  or 
may  not  be  linear  (depending  on  the  particular  model  you  formulate  for 
each  question).  Test: 

a.  Hq  :  WBC  is  not  associated  with  the  response  versus  Ha  :  WBC  is 
linearly  associated  with  the  property  of  the  response. 

b.  Hq  :  WBC  is  not  associated  with  Y  versus  Ha  :  WBC  is  quadratically 
associated  with  Y.  Also  write  down  the  formal  test  of  linearity  against 
this  quadratic  alternative. 

c.  H0  :  WBC  is  not  associated  with  Y  versus  Ha  :  WBC  related  to  the 
property  of  the  response  through  a  smooth  spline  function;  for  example, 
for  WBC  the  model  requires  the  variables  WBC,  WBC7,  and  WBC" 
where  WBC7  and  WBC77  represent  nonlinear  components  (if  there  are 
four  knots  in  a  restricted  cubic  spline  function).  Also  write  down  the 
formal  test  of  linearity  against  this  spline  function  alternative. 

d.  Test  for  a  lack  of  fit  (combined  nonlinearity  or  non-additivity)  in  an 
overall  model  that  takes  the  form  of  an  interaction  between  temperature 
and  WBC,  allowing  WBC  to  be  modeled  with  a  smooth  spline  function. 

4.  For  a  fitted  model  Y  =  a  +  bX  +  cX 2  derive  the  estimate  of  the  effect  on 
Y  of  changing  X  from  x\  to  X2- 

c  In  other  words,  under  what  assumptions  does  the  test  have  maximum  power? 
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5.  In  “The  Class  of  1988:  A  Statistical  Portrait,”  the  College  Board  reported 
mean  SAT  scores  for  each  state.  Use  an  ordinary  least  squares  multiple 
regression  model  to  study  the  mean  verbal  SAT  score  as  a  function  of  the 
percentage  of  students  taking  the  test  in  each  state.  Provide  plots  of  fitted 
functions  and  defend  your  choice  of  the  “best”  fit.  Make  sure  the  shape  of 
the  chosen  fit  agrees  with  what  you  know  about  the  variables.  Add  the 
raw  data  points  to  plots. 

a.  Fit  a  linear  spline  function  with  a  knot  at  X  =  50%.  Plot  the  data 
and  the  fitted  function  and  do  a  formal  test  for  linearity  and  a  test 
for  association  between  X  and  Y .  Give  a  detailed  interpretation  of  the 
estimated  coefficients  in  the  linear  spline  model,  and  use  the  partial 
t-test  to  test  linearity  in  this  model. 

b.  Fit  a  restricted  cubic  spline  function  with  knots  at  X  =  6,  12,  58,  and 
68%  (not  percentile). cl  Plot  the  fitted  function  and  do  a  formal  test  of 
association  between  X  and  Y.  Do  two  tests  of  linearity  that  test  the 
same  hypothesis: 

i.  by  using  a  contrast  to  simultaneously  test  the  correct  set  of  coeffi¬ 
cients  against  zero  (done  by  the  anova  function  in  rms);e 

ii.  by  comparing  the  R 2  from  the  complex  model  with  that  from  a  simple 
linear  model  using  a  partial  F-test. 

Explain  why  the  tests  of  linearity  have  the  d.f.  they  have. 

c.  Using  subject  matter  knowledge,  pick  a  final  model  (from  among  the 
previous  models  or  using  another  one)  that  makes  sense. 

The  data  are  found  in  Table  2.4  and  may  be  created  in  R  using  the  sat.r 
code  on  the  RMS  course  web  site. 

6.  Derive  the  formulas  for  the  restricted  cubic  spline  component  variables 
without  cubing  or  squaring  any  terms. 

7.  Prove  that  each  component  variable  is  linear  in  X  when  X  >  £&,  the 
last  knot,  using  general  principles  and  not  algebra  or  calculus.  Derive  an 
expression  for  the  restricted  spline  regression  function  when  X  >  tk- 

8.  Consider  a  two-stage  procedure  in  which  one  tests  for  linearity  of  the  effect 
of  a  predictor  X  on  a  property  of  the  response  C(Y \X)  against  a  quadratic 
alternative.  If  the  two-tailed  test  of  linearity  is  significant  at  the  a  level, 
a  two  d.f.  test  of  association  between  X  and  Y  is  done.  If  the  test  for 
linearity  is  not  significant,  the  square  term  is  dropped  and  a  linear  model 
is  fitted.  The  test  of  association  between  X  and  Y  is  then  (apparently)  a 
one  d.f.  test. 

a.  Write  a  formal  expression  for  the  test  statistic  for  association. 


d  Note:  To  pre-specify  knots  for  restricted  cubic  spline  functions,  use  something  like 
res  (predictor ,  c(tl  ,t2,t3,t4)  ) ,  where  the  knot  locations  are  tl,  t2,  t3,  t4. 

e  Note  that  anova  in  rms  computes  all  needed  test  statistics  from  a  single  model  fit 
object. 
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b.  Write  an  expression  for  the  nominal  P- value  for  testing  association 
using  this  strategy. 

c.  Write  an  expression  for  the  actual  P- value  or  alternatively  for  the  type- 
I  error  if  using  a  fixed  critical  value  for  the  test  of  association. 

d.  For  the  same  two-stage  strategy  consider  an  estimate  of  the  effect  on 
C(Y \X)  of  increasing  X  from  a  to  b.  Write  a  brief  symbolic  algorithm 
for  deriving  a  true  two-sided  1  —  a  confidence  interval  for  the  b  :  a  effect 
(difference  in  C(Y))  using  the  bootstrap. 


Table  2.4  SAT  data  from  the  College  Board,  1988 

%  Taking  SAT  Mean  Verbal  %  Taking  SAT  Mean  Verbal 


(X)  Score  (V)  (X)  Score  (V) 


4 

482 

24 

440 

5 

498 

29 

460 

5 

513 

37 

448 

6 

498 

43 

441 

6 

511 

44 

424 

7 

479 

45 

417 

9 

480 

49 

422 

9 

483 

50 

441 

10 

475 

52 

408 

10 

476 

55 

412 

10 

487 

57 

400 

10 

494 

58 

401 

12 

474 

59 

430 

12 

478 

60 

433 

13 

457 

62 

433 

13 

485 

63 

404 

14 

451 

63 

424 

14 

471 

63 

430 

14 

473 

64 

431 

16 

467 

64 

437 

17 

470 

68 

446 

18 

464 

69 

424 

20 

471 

72 

420 

22 

455 

73 

432 

23 

452 

81 

436 

Chapter  3 

Missing  Data 


3.1  Types  of  Missing  Data 


There  are  missing  data  in  the  majority  of  datasets  one  is  likely  to  encounter. 
Before  discussing  some  of  the  problems  of  analyzing  data  in  which  some 
variables  are  missing  for  some  subjects,  we  define  some  nomenclature. 


Missing  completely  at  random  (MCAR) 

Data  are  missing  for  reasons  that  are  unrelated  to  any  characteristics  or  re¬ 
sponses  for  the  subject,  including  the  value  of  the  missing  value,  were  it  to 
be  known.  Examples  include  missing  laboratory  measurements  because  of  a 
dropped  test  tube  (if  it  was  not  dropped  because  of  knowledge  of  any  mea¬ 
surements),  a  study  that  ran  out  of  funds  before  some  subjects  could  return 
for  follow-up  visits,  and  a  survey  in  which  a  subject  omitted  her  response  to 
a  question  for  reasons  unrelated  to  the  response  she  would  have  made  or  to 
any  other  of  her  characteristics. 


Missing  at  random  (MAR) 

Data  are  not  missing  at  random,  but  the  probability  that  a  value  is  missing 
depends  on  values  of  variables  that  were  actually  measured.  As  an  example, 
consider  a  survey  in  which  females  are  less  likely  to  provide  their  personal 
income  in  general  (but  the  likelihood  of  responding  is  independent  of  her 
actual  income).  If  we  know  the  sex  of  every  subject  and  have  income  levels 
for  some  of  the  females,  unbiased  sex-specific  income  estimates  can  be  made. 
That  is  because  the  incomes  we  do  have  for  some  of  the  females  are  a  random 
sample  of  all  females’  incomes.  Another  way  of  saying  that  a  variable  is  MAR 
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is  that  given  the  values  of  other  available  variables,  subjects  having  missing 
values  are  only  randomly  different  from  other  subjects.535  Or  to  paraphrase 
Greenland  and  Finkle,242  for  MAR  the  missingness  of  a  covariable  cannot 
depend  on  unobserved  covariable  values;  for  example  whether  a  predictor  is 
observed  cannot  depend  on  another  predictor  when  the  latter  is  missing  but 
it  can  depend  on  the  latter  when  it  is  observed.  MAR  and  MCAR  data  are 
also  called  ignorable  non-responses. 


Informative  missing  (IM) 

The  tendency  for  a  variable  to  be  missing  is  a  function  of  data  that  are  not 
available,  including  the  case  when  data  tend  to  be  missing  if  their  true  values 
are  systematically  higher  or  lower.  An  example  is  when  subjects  with  lower 
income  levels  or  very  high  incomes  are  less  likely  to  provide  their  personal  in¬ 
come  in  an  interview.  IM  is  also  called  nonignorable  non-response  and  missing 
not  at  random  (MNAR). 

IM  is  the  most  difficult  type  of  missing  data  to  handle.  In  many  cases,  there 
is  no  fix  for  IM  nor  is  there  a  way  to  use  the  data  to  test  for  the  existence  of 
IM.  External  considerations  must  dictate  the  choice  of  missing  data  models, 
and  there  are  few  clues  for  specifying  a  model  under  IM.  MCAR  is  the  easiest 
case  to  handle.  Our  ability  to  correctly  analyze  MAR  data  depends  on  the 
availability  of  other  variables  (the  sex  of  the  subject  in  the  example  above). 
Most  of  the  methods  available  for  dealing  with  missing  data  assume  the  data 
are  MAR.  Fortunately,  even  though  the  MAR  assumption  is  not  testable,  it 
may  hold  approximately  if  enough  variables  are  included  in  the  imputation 
models256. 


3.2  Prelude  to  Modeling 

No  matter  whether  one  deletes  incomplete  cases,  carefully  imputes  (esti¬ 
mates)  missing  data,  or  uses  a  full  maximum  likelihood  or  Bayesian  tech¬ 
niques  to  incorporate  partial  data,  it  is  beneficial  to  characterize  patterns 
of  missingness  using  exploratory  data  analysis  techniques.  These  techniques 
include  binary  logistic  models  and  recursive  partitioning  for  predicting  the 
probability  that  a  given  variable  is  missing.  Patterns  of  missingness  should  be 
reported  to  help  readers  understand  the  limitations  of  incomplete  data.  If  you 
do  decide  to  use  imputation,  it  is  also  important  to  describe  how  variables  are 
simultaneously  missing.  A  cluster  analysis  of  missing  value  status  of  all  the 
variables  is  useful  here.  This  can  uncover  cases  where  imputation  is  not  as  ef¬ 
fective.  For  example,  if  the  only  variable  moderately  related  to  diastolic  blood 
pressure  is  systolic  pressure,  but  both  pressures  are  missing  on  the  same  sub¬ 
jects,  systolic  pressure  cannot  be  used  to  estimate  diastolic  blood  pressure.  R 
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functions  naclus  and  naplot  in  the  Hmisc  package  (see  p.  142)  can  help  detect 
how  variables  are  simultaneously  missing.  Recursive  partitioning  (regression 
tree)  algorithms  (see  Section  2.5)  are  invaluable  for  describing  which  kinds  of 
subjects  are  missing  on  a  variable.  Logistic  regression  is  also  an  excellent  tool 
for  this  purpose.  A  later  example  (p.  302)  demonstrates  these  procedures. 

It  can  also  be  helpful  to  explore  the  distribution  of  non-missing  Y  by  the 
number  of  missing  variables  in  X  (including  zero,  i.e.,  complete  cases  on  X). 


3.3  Missing  Values  for  Different  Types 
of  Response  Variables 


When  the  response  variable  Y  is  collected  serially  but  some  subjects  drop  out 
of  the  study  before  completion,  there  are  many  ways  of  dealing  with  partial 
information  2,412,480  including  multiple  imputation  in  phases,381  or  efficiently 
analyzing  all  available  serial  data  using  a  full  likelihood  model.  When  Y  is  the 
time  until  an  event,  there  are  actually  no  missing  values  of  Y  but  follow-up 
will  be  curtailed  for  some  subjects.  That  leaves  the  case  where  the  response 
is  completely  measured  once. 

It  is  common  practice  to  discard  subjects  having  missing  Y.  Before  doing 
so,  at  minimum  an  analysis  should  be  done  to  characterize  the  tendency 
for  Y  to  be  missing,  as  just  described.  For  example,  logistic  regression  or 
recursive  partitioning  can  be  used  to  predict  whether  Y  is  missing  and  to 
test  for  systematic  tendencies  as  opposed  to  Y  being  missing  completely  at 
random.  In  many  models,  though,  more  efficient  and  less  biased  estimates  of 
regression  coefficients  can  be  made  by  also  utilizing  observations  missing  on 
Y  that  are  non-missing  on  X.  Hence  there  is  a  definite  place  for  imputation 
of  Y.  von  Hippel  found  advantages  of  using  ah  variables  to  impute  ah 
others,  and  once  imputation  is  finished,  discarding  those  observations  having 
missing  Y.  However  if  missing  Y  values  are  MCAR,  up-front  deletion  of  cases 
having  missing  Y  may  sometimes  be  preferred,  as  imputation  requires  correct 
specification  of  the  imputation  model. 


3.4  Problems  with  Simple  Alternatives 
to  Imputation 

Incomplete  predictor  information  is  a  very  common  missing  data  problem. 
Statistical  software  packages  use  casewise  deletion  in  handling  missing  predic¬ 
tors;  that  is,  any  subject  having  any  predictor  or  Y  missing  will  be  excluded 
from  a  regression  analysis.  Casewise  deletion  results  in  regression  coefficient 
estimates  that  can  be  terribly  biased,  imprecise,  or  both353.  First  consider  an 
example  where  bias  is  the  problem.  Suppose  that  the  response  is  death  and 
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the  predictors  are  age,  sex,  and  blood  pressure,  and  that  age  and  sex  were 
recorded  for  every  subject.  Suppose  that  blood  pressure  was  not  measured 
for  a  fraction  of  0.10  of  the  subjects,  and  the  most  common  reason  for  not 
obtaining  a  blood  pressure  was  that  the  subject  was  about  to  die.  Deletion 
of  these  very  sick  patients  will  cause  a  major  bias  (downward)  in  the  model’s 
intercept  parameter.  In  general,  casewise  deletion  will  bias  the  estimate  of 
the  model’s  intercept  parameter  (as  well  as  others)  when  the  probability  of 
a  case  being  incomplete  is  related  to  Y  and  not  just  to  X  [422,  Example 
3.3].  van  der  Heijden  et  al.  28  discuss  how  complete  case  analysis  (casewise 
deletion)  usually  assumes  MCAR. 

Now  consider  an  example  in  which  casewise  deletion  of  incomplete  records 
is  inefficient.  The  inefficiency  comes  from  the  reduction  of  sample  size,  which 
causes  standard  errors  to  increase,162  confidence  intervals  to  widen,  and  power 
of  tests  of  association  and  tests  of  lack  of  fit  to  decrease.  Suppose  that  the 
response  is  the  presence  of  coronary  artery  disease  and  the  predictors  are 
age,  sex,  LDL  cholesterol,  HDL  cholesterol,  blood  pressure,  triglyceride,  and 
smoking  status.  Suppose  that  age,  sex,  and  smoking  are  recorded  for  all  sub¬ 
jects,  but  that  LDL  is  missing  in  0.18  of  the  subjects,  HDL  is  missing  in  0.20, 
and  triglyceride  is  missing  in  0.21.  Assume  that  all  missing  data  are  MCAR 
and  that  all  of  the  subjects  missing  LDL  are  also  missing  HDL  and  that 
overall  0.28  of  the  subjects  have  one  or  more  predictors  missing  and  hence 
would  be  excluded  from  the  analysis.  If  total  cholesterol  were  known  on  every 
subject,  even  though  it  does  not  appear  in  the  model,  it  (along  perhaps  with 
age  and  sex)  can  be  used  to  estimate  ( impute )  LDL  and  HDL  cholesterol  and 
triglyceride,  perhaps  using  regression  equations  from  other  studies.  Doing  the 
analysis  on  a  “filled  in”  dataset  will  result  in  more  precise  estimates  because 
the  sample  size  would  then  include  the  other  0.28  of  the  subjects. 

In  general,  observations  should  only  be  discarded  if  the  MCAR  assump¬ 
tion  is  justified,  there  is  a  rarely  missing  predictor  of  overriding  importance 
that  cannot  be  reliably  imputed  from  other  information,  or  if  the  fraction  of 
observations  excluded  is  very  small  and  the  original  sample  size  is  large.  Even 
then,  there  is  no  advantage  of  such  deletion  other  than  saving  analyst  time. 
If  a  predictor  is  MAR  but  its  missingness  depends  on  F,  casewise  deletion  is 
biased. 

The  first  blood  pressure  example  points  out  why  it  can  be  dangerous  to 
handle  missing  values  by  adding  a  dummy  variable  to  the  model.  Many  ana¬ 
lysts  would  set  missing  blood  pressures  to  a  constant  (it  doesn’t  matter  which 
constant)  and  add  a  variable  to  the  model  such  as  is. na(blood. pressure)  in 
R  notation.  The  coefficient  for  the  latter  dummy  variable  will  be  quite  large 
in  the  earlier  example,  and  the  model  will  appear  to  have  great  ability  to 
predict  death.  This  is  because  some  of  the  left-hand  side  of  the  model  con¬ 
taminates  the  right-hand  side;  that  is,  is .na(blood. pressure)  is  correlated 
with  death.  For  categorical  variables,  another  common  practice  is  to  add  a 
new  category  to  denote  missing,  adding  one  more  degree  of  freedom  to  the 
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predictor  and  changing  its  meaning. a  Jones326,  Allison  [12,  pp.  9-11],  Don- 
ders  et  al.161,  Knol  et  al.  and  van  der  Heijden  et  al.628  describe  why  both 
of  these  missing-indicator  methods  are  invalid  even  when  MCAR  holds. 


3.5  Strategies  for  Developing  an  Imputation  Model 

Except  in  special  circumstances  that  usually  involve  only  very  simple  models, 
the  primary  alternative  to  deleting  incomplete  observations  is  imputation  of 
the  missing  values.  Many  non-statisticians  find  the  notion  of  estimating  data 
distasteful,  but  the  way  to  think  about  imputation  of  missing  values  is  that 
“making  up”  data  is  better  than  discarding  valuable  data.  It  is  especially  dis¬ 
tressing  to  have  to  delete  subjects  who  are  missing  on  an  adjustment  variable 
when  a  major  variable  of  interest  is  not  missing.  So  one  goal  of  imputation 
is  to  use  as  much  information  as  possible  for  examining  any  one  predictor’s 
adjusted  association  with  Y.  The  overall  goal  of  imputation  is  to  preserve  the 
information  and  meaning  of  the  non-missing  data. 

At  this  point  the  analyst  must  make  some  decisions  about  the  information 
to  use  in  computing  predicted  values  for  missing  values. 

1.  Imputation  of  missing  values  for  one  of  the  variables  can  ignore  all  other 
information.  Missing  values  can  be  filled  in  by  sampling  non-missing  values 
of  the  variable,  or  by  using  a  constant  such  as  the  median  or  mean  non¬ 
missing  value. 

2.  Imputation  algorithms  can  be  based  only  on  external  information  not  oth¬ 
erwise  used  in  the  model  for  Y  in  addition  to  variables  included  in  later 
modeling.  For  example,  family  income  can  be  imputed  on  the  basis  of  loca¬ 
tion  of  residence  when  such  information  is  to  remain  confidential  for  other 
aspects  of  the  analysis  or  when  such  information  would  require  too  many 
degrees  of  freedom  to  be  spent  in  the  ultimate  response  model. 

3.  Imputations  can  be  derived  by  only  analyzing  interrelationships  among 
the  As. 

4.  Imputations  can  use  relationships  among  the  As  and  between  A  and  Y. 

5.  Imputations  can  use  A,  T,  and  auxiliary  variables  not  in  the  model 
predicting  Y. 

6.  Imputations  can  take  into  account  the  reason  for  non-response  if  known. 

The  model  to  estimate  the  missing  values  in  a  sometimes-missing  (target) 
variable  should  include  all  variables  that  are  either 


a  This  may  work  if  values  are  “missing”  because  of  “not  applicable”,  e.g.  one  has  a 
measure  of  marital  happiness,  dichotomized  as  high  or  low,  but  the  sample  contains 
some  unmarried  people.  One  could  have  a  3-category  variable  with  values  high,  low, 
and  unmarried  (Paul  Allison,  IMPUTE  e-mail  list,  4Jul09). 
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1.  related  to  the  missing  data  mechanism; 

2.  have  distributions  that  differ  between  subjects  that  have  the  target  variable 
missing  and  those  that  have  it  measured; 

3.  are  associated  with  the  target  variable  when  it  is  not  missing;  or 

4.  are  included  in  the  final  response  model  3 . 

The  imputation  and  analysis  (response)  models  should  be  “congenial”  or  the 
imputation  model  should  be  more  general  than  the  response  model  or  make 
well-founded  assumptions256 . 

When  a  variable,  say  Aj,  is  to  be  included  as  a  predictor  of  Y,  and  Xj 
is  sometimes  missing,  ignoring  the  relationship  between  Xj  and  Y  for  those 
observations  for  which  both  are  known  will  bias  regression  coefficients  for 
Xj  toward  zero  in  the  outcome  model.421  On  the  other  hand,  using  Y  to 
singly  impute  Xj  using  a  conditional  mean  will  cause  a  large  inflation  in 
the  apparent  importance  of  Xj  in  the  final  model.  In  other  words,  when  the 
missing  Xj  are  replaced  with  a  mean  that  is  conditional  on  Y  without  a 
random  component,  this  will  result  in  a  falsely  strong  relationship  between 
the  imputed  Xj  values  and  Y. 

At  first  glance  it  might  seem  that  using  Y  to  impute  one  or  more  of  the  As, 
even  with  allowance  for  the  correct  amount  of  random  variation,  would  result 
in  a  circular  analysis  in  which  the  importance  of  the  As  will  be  exaggerated. 
But  the  relationship  between  A  and  Y  in  the  subset  of  imputed  observations 
will  only  be  as  strong  as  the  associations  between  A  and  Y  that  are  evidenced 
by  the  non-missing  data.  In  other  words,  regression  coefficients  estimated 
from  a  dataset  that  is  completed  by  imputation  will  not  in  general  be  biased 
high  as  long  as  the  imputed  values  have  similar  variation  as  non-missing  data 
values. 

The  next  important  decision  about  developing  imputation  algorithms  is 
the  choice  of  how  missing  values  are  estimated. 

1.  Missings  can  be  estimated  using  single  “best  guesses”  (e.g.,  predicted  con¬ 
ditional  expected  values  or  means)  based  on  relationships  between  non¬ 
missing  values.  This  is  called  single  imputation  of  conditional  means. 

2.  Missing  X3  (or  Y)  can  be  estimated  using  single  individual  predicted  val¬ 
ues,  where  by  predicted  value  we  mean  a  random  variable  value  from  the 
whole  conditional  distribution  of  Xj.  If  one  uses  ordinary  multiple  regres¬ 
sion  to  estimate  Xj  from  Y  and  the  other  As,  a  random  residual  would 
be  added  to  the  predicted  mean  value.  If  assuming  a  normal  distribution 
for  Xj  conditional  on  the  other  data,  such  a  residual  could  be  computed 
by  a  Gaussian  random  number  generator  given  an  estimate  of  the  residual 
standard  deviation.  If  normality  is  not  assumed,  the  residual  could  be  a 
randomly  chosen  residual  from  the  actual  computed  residuals.  When  m 
missing  values  need  imputation  for  Aj,  the  residuals  could  be  sampled 
with  replacement  from  the  entire  vector  of  residuals  as  in  the  bootstrap. 
Better  still  according  to  Rubin  and  Schenker535  would  be  to  use  the  “ap¬ 
proximate  Bayesian  bootstrap”  which  involves  sampling  n  residuals  with 
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replacement  from  the  original  n  estimated  residuals  (from  observations  not 
missing  on  Xj),  then  sampling  m  residuals  with  replacement  from  the  first 
sampled  set. 

3.  More  than  one  random  predicted  value  (as  just  defined)  can  be  generated 
for  each  missing  value.  This  process  is  called  multiple  imputation  and  it 
has  many  advantages  over  the  other  methods  in  general.  This  is  discussed 
in  Section  3.8. 

4.  Matching  methods  can  be  used  to  obtain  random  draws  of  other  subject’s 
values  to  replace  missing  values.  Nearest  neighbor  matching  can  be  used 
to  select  a  subject  that  is  “close”  to  the  subject  in  need  of  imputation, 
on  the  basis  of  a  series  of  variables.  This  method  requires  the  analyst  to 
make  decisions  about  what  constitutes  “closeness.”  To  simplify  the  match¬ 
ing  process  into  a  single  dimension,  Little420  proposed  the  predictive  mean 
matching  method  where  matching  is  done  on  the  basis  of  predicted  values 
from  a  regression  model  for  predicting  the  sometimes-missing  variable  (sec¬ 
tion  3.7).  According  to  Little,  in  large  samples  predictive  mean  matching 
may  be  more  robust  to  model  misspecification  than  the  method  of  adding 
a  random  residual  to  the  subject’s  predicted  value,  but  because  of  diffi¬ 
culties  in  finding  matches  the  random  residual  method  may  be  better  in 
smaller  samples.  The  random  residual  method  may  be  easier  to  use  when 
multiple  imputations  are  needed,  but  care  must  be  taken  to  create  the 
correct  degree  of  uncertainty  in  residuals. 

What  if  Xj  needs  to  be  imputed  for  some  subjects  based  on  other  variables 
that  themselves  may  be  missing  on  the  same  subjects  missing  on  Xj ?  This  is 
a  place  where  recursive  partitioning  with  “surrogate  splits”  in  case  of  missing 
predictors  may  be  a  good  method  for  developing  imputations  (see  Section  2.5 
and  p.  142).  If  using  regression  to  estimate  missing  values,  an  algorithm 
to  cycle  through  all  sometimes-missing  variables  for  multiple  iterations  may 
perform  well.  This  algorithm  is  used  by  the  R  transcan  function  described 
in  Section  4.7.4  as  well  as  the  to-be-described  areglmpute  function.  First,  all 
missing  values  are  initialized  to  medians  (modes  for  categorical  variables). 
Then  every  time  missing  values  are  estimated  for  a  certain  variable,  those 
estimates  are  inserted  the  next  time  the  variable  is  used  to  predict  other 
sometimes-missing  variables. 

If  you  want  to  assess  the  importance  of  a  specific  predictor  that  is  fre¬ 
quently  missing,  it  is  a  good  idea  to  perform  a  sensitivity  analysis  in  which 
all  observations  containing  imputed  values  for  that  predictor  are  temporarily 
deleted.  The  test  based  on  a  model  that  included  the  imputed  values  may  be 
diluted  by  the  imputation  or  it  may  test  the  wrong  hypothesis,  especially  if 
Y  is  not  used  in  imputing  X. 

Little  argues  for  down-weighting  observations  containing  imputations,  to 
obtain  a  more  accurate  variance-covariance  matrix.  For  the  ordinary  linear 
model,  the  weights  have  been  worked  out  for  some  cases  [421,  p.  1231]. 
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3.6  Single  Conditional  Mean  Imputation 

For  a  continuous  or  binary  X  that  is  unrelated  to  all  other  predictor  vari¬ 
ables,  the  mean  or  median  may  be  substituted  for  missing  values  without 
much  loss  of  efficiency,162  although  regression  coefficients  will  be  biased  low 
since  Y  was  not  utilized  in  the  imputation.  When  the  variable  of  interest 
is  related  to  the  other  Xs,  it  is  far  more  efficient  to  use  an  individual  pre¬ 
dictive  model  for  each  X  based  on  the  other  variables.79,525,612  The  “best 
guess”  imputation  method  fills  in  missings  with  predicted  expected  values 
using  the  multivariable  imputation  model  based  on  non-missing  data6.  It  is 
true  that  conditional  means  are  the  best  estimates  of  unknown  values,  but 
except  perhaps  for  binary  logistic  regression  21,623  their  use  will  result  in  bi¬ 
ased  estimates  and  very  biased  (low)  variance  estimates.  The  latter  problem 
arises  from  the  reduced  variability  of  imputed  values  [174,  p.  464]. 

Tree-based  models  (Section  2.5)  may  be  very  useful  for  imputation  since 
they  do  not  require  linearity  or  additivity  assumptions,  although  such  models 
often  have  poor  discrimination  when  they  don’t  overfit.  When  a  continuous 
X  being  imputed  needs  to  be  non-monotonically  transformed  to  best  relate 
it  to  the  other  Xs  (e.g.,  blood  pressure  vs.  heart  rate),  trees  and  ordinary 
regression  are  inadequate.  Here  a  general  transformation  modeling  procedure 
(Section  4.7)  may  be  needed. 

Schemper  et  al. 551,553  proposed  imputing  missing  binary  covariables  by 
predicted  probabilities.  For  categorical  sometimes-missing  variables,  imputa¬ 
tion  models  can  be  derived  using  polytomous  logistic  regression  or  a  classifi¬ 
cation  tree  method.  For  missing  values,  the  most  likely  value  for  each  subject 
(from  the  series  of  predicted  probabilities  from  the  logistic  or  recursive  par¬ 
titioning  model)  can  be  substituted  to  avoid  creating  a  new  category  that  is 
falsely  highly  correlated  with  Y.  For  an  ordinal  X,  the  predicted  mean  value 
(possibly  rounded  to  the  nearest  actual  data  value)  or  median  value  from  an 
ordinal  logistic  model  is  sometimes  useful. 


3.7  Predictive  Mean  Matching 

In  predictive  mean  matching42  (PMM),  one  replaces  a  missing  (na)  value 
for  the  target  variable  being  imputed  with  the  actual  value  from  a  donor 
observation.  Donors  are  identified  by  matching  in  only  one  dimension,  namely 
the  predicted  value  (e.g.,  predicted  mean)  of  the  target.  Key  considerations 
are  how  to 


b  Predictors  of  the  target  variable  include  all  the  other  Xs  along  with  auxiliary 
variables  that  are  not  included  in  the  final  outcome  model,  as  long  as  they  precede 
the  variable  being  imputed  in  the  causal  chain  (unlike  with  multiple  imputation). 
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1.  model  the  target  when  it  is  not  NA 

2.  match  donors  on  predicted  values 

3.  avoid  overuse  of  “good”  donors  to  disallow  excessive  ties  in  imputed  data 

4.  account  for  all  uncertainties  (section  3.8). 

The  predictive  model  for  each  target  variable  uses  any  outcome  variables,  all 
predictors  in  the  final  outcome  model,  plus  any  needed  auxiliary  variables. 
The  modeling  method  should  be  flexible,  not  assuming  linearity.  Many  meth¬ 
ods  will  suffice;  parametric  additive  models  are  often  good  choices.  Beauties 
of  PMM  include  the  lack  of  need  for  distributional  assumptions  (as  no  resid¬ 
uals  are  calculated) ,  and  predicted  values  need  only  be  monotonically  related 
to  real  predicted  values0 

In  the  original  PMM  method  the  donor  for  an  NA  was  the  complete  obser¬ 
vation  whose  predicted  target  was  closest  to  the  predicted  value  of  the  target 
from  all  complete  observations01.  This  approach  can  result  in  some  donors 
being  used  repeatedly.  This  can  be  addressed  by  sampling  from  a  multino¬ 
mial  distribution,  where  the  probabilities  are  scaled  distances  of  all  potential 
donors’  predictions  to  the  predicted  value  y *  of  the  missing  target.  Tukey’s 
tricube  function  (used  in  loess)  is  a  good  weighting  function,  implemented  in 
the  Hmisc  areglmpute  function: 
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(3.1) 


s  above  is  a  good  default  scale  factor,  and  the  Wi  are  scaled  so  that  'Yh'Wi  =  1. 


3.8  Multiple  Imputation 

Imputing  missing  values  and  then  doing  an  ordinary  analysis  as  if  the  imputed 
values  were  real  measurements  is  usually  better  than  excluding  subjects  with 
incomplete  data.  However,  ordinary  formulas  for  standard  errors  and  other 
statistics  are  invalid  unless  imputation  is  taken  into  account.651  Methods  for 
properly  accounting  for  having  incomplete  data  can  be  complex.  The  boot¬ 
strap  (described  later)  is  an  easy  method  to  implement,  but  the  computations 
can  be  slow0. 


c  Thus  when  modeling  binary  or  categorical  targets  one  can  frequently  take  least 
squares  shortcuts  in  place  of  maximum  likelihood  for  binary,  ordinal,  or  multinomial 
logistic  models. 

d  662  discusses  an  alternative  method  based  on  choosing  a  donor  observation  at 
random  from  the  q  closest  matches  ( q  =  3,  for  example). 

e  To  use  the  bootstrap  to  correctly  estimate  variances  of  regression  coefficients,  one 
must  repeat  the  imputation  process  and  the  model  fitting  perhaps  100  times  using  a 
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Multiple  imputation  uses  random  draws  from  the  conditional  distribu¬ 
tion  of  the  target  variable  given  the  other  variables  (and  any  additional  in¬ 
formation  that  is  relevant)85,417,421,536.  The  additional  information  used  to 
predict  the  missing  values  can  contain  any  variables  that  are  potentially  pre¬ 
dictive,  including  variables  measured  in  the  future;  the  causal  chain  is  not 
relevant.421,463  When  a  regression  model  is  used  for  imputation,  the  process 
involves  adding  a  random  residual  to  the  “best  guess”  for  missing  values,  to 
yield  the  same  conditional  variance  as  the  original  variable.  Methods  for  esti¬ 
mating  residuals  were  listed  in  Section  3.5.  To  properly  account  for  variability 
due  to  unknown  values,  the  imputation  is  repeated  M  times,  where  M  >  3. 
Each  repetition  results  in  a  “completed”  dataset  that  is  analyzed  using  the 
standard  method.  Parameter  estimates  are  averaged  over  these  multiple  im¬ 
putations  to  obtain  better  estimates  than  those  from  single  imputation.  The 
variance-covariance  matrix  of  the  averaged  parameter  estimates,  adjusted  for 
variability  due  to  imputation,  is  estimated  using422 

M  71 /[  i  1 

v  =  +  (3.2) 

i 

where  Vi  is  the  ordinary  complete  data  estimate  of  the  variance-covariance 
matrix  for  the  model  parameters  from  the  ith  imputation,  and  B  is  the 
between-imputation  sample  variance-covariance  matrix,  the  diagonal  entries 
of  which  are  the  ordinary  sample  variances  of  the  M  parameter  estimates. 

After  running  areglmpute  (or  MICE)  you  can  run  the  Hmisc  packages’s 
fit  .mult .  impute  function  to  fit  the  chosen  model  separately  for  each  artificially 
completed  dataset  corresponding  to  each  imputation.  After  fit. mult. impute 
fits  all  of  the  models,  it  averages  the  sets  of  regression  coefficients  and  com¬ 
putes  variance  and  covariance  estimates  that  are  adjusted  for  imputation 
(using  Eq.  3.2). 

White  and  Royston  1  provide  a  method  for  multiply  imputing  missing 
covariate  values  using  censored  survival  time  data  in  the  context  of  the  Cox 
proportional  hazards  model. 

White  et  al.662  recommend  choosing  the  number  of  imputations  M  so 
that  the  key  inferential  statistics  are  very  reproducible  should  the  imputation 
analysis  be  repeated.  They  suggest  the  use  of  100/  imputations  when  /  is 
the  fraction  of  cases  that  are  incomplete.  See  also  [85,  Section  2.7]  and232. 
Extreme  amount  of  missing  data  does  not  prevent  one  from  using  multiple 
imputation,  because  alternatives  are  worse321.  Horton  and  Lipsitz'  02  also 
have  a  good  overview  of  multiple  imputation  and  a  review  of  several  software 
packages  that  implement  PMM. 

Caution:  Multiple  imputation  methods  can  generate  imputations  hav¬ 
ing  very  reasonable  distributions  but  still  not  having  the  property  that  final 

resampling  procedure174, 566  (see  Section  5.2).  Still,  the  bootstrap  can  estimate  the 
right  variance  for  the  wrong  parameter  estimates  if  the  imputations  are  not  done 
correctly. 


3.8  Multiple  Imputation 
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response  model  regression  coefficients  have  nominal  confidence  interval  cov¬ 
erage.  Among  other  things,  it  is  worth  checking  that  imputations  generate 
the  correct  collinearities  among  covariates. 


3. 8. 1  The  areglmpute  and  Other  Chained  Equations 
Approaches 

A  flexible  approach  to  multiple  imputation  that  handles  a  wide  variety  of 
target  variables  to  be  imputed  and  allows  for  multiple  variables  to  be  miss¬ 
ing  on  the  same  subject  is  the  chained  equation  method.  With  a  chained 
equations  approach,  each  target  variable  is  predicted  by  a  regression  model 
conditional  on  all  other  variables  in  the  model,  plus  other  variables.  An  it¬ 
erative  process  cycles  through  all  target  variables  to  impute  all  missing  val¬ 
ues62  7 .  This  approach  is  used  in  the  MICE  algorithm  (multiple  imputation  using 
chained  equations)  implemented  in  R  and  other  systems.  The  chained  equa¬ 
tion  method  does  not  attempt  to  use  the  full  Bayesian  multivariate  model  for 
all  target  variables,  which  makes  it  more  flexible  and  easy  to  use  but  leaves  it 
open  to  creating  improper  imputations,  e.g.,  imputing  conflicting  values  for 
different  target  variables.  However,  simulation  studies627  so  far  have  demon¬ 
strated  very  good  performance  of  imputation  based  on  chained  equations  in 
non-complex  situations. 

The  areglmpute  algorithm  63  takes  all  aspects  of  uncertainty  into  account 
using  the  bootstrap  while  using  the  same  estimation  procedures  as  transcan 
(section  4.7).  Different  bootstrap  resamples  used  for  each  imputation  by  fit¬ 
ting  a  flexible  additive  model  on  a  sample  with  replacement  from  the  original 
data.  This  model  is  used  to  predict  all  of  the  original  missing  and  non-missing 
values  for  the  target  variable  for  the  current  imputation,  areglmpute  uses  flex¬ 
ible  parametric  additive  regression  spline  models  to  predict  target  variables. 
There  is  an  option  to  allow  target  variables  to  be  optimally  transformed,  even 
non-monotonically  (but  this  can  overfit).  The  function  implements  regression 
imputation  based  on  adding  random  residuals  to  predicted  means,  but  its 
real  value  lies  in  implementing  a  wide  variety  of  PMM  algorithms. 

The  default  method  used  by  areglmpute  is  (weighted)  PMM  so  that 
no  residuals  or  distributional  assumptions  are  required.  The  default  PMM 
matching  used  is  van  Buuren’s  “Type  1”  matching  [85,  Section  3.4.2]  to  cap¬ 
ture  the  right  amount  of  uncertainty.  Here  one  computes  predicted  values 
for  missing  values  using  a  regression  fit  on  the  bootstrap  sample,  and  finds 
donor  observations  by  matching  those  predictions  to  predictions  from  poten¬ 
tial  donors  using  the  regression  fit  from  the  original  sample  of  complete  obser¬ 
vations.  When  a  predictor  of  the  target  variable  is  missing,  it  is  first  imputed 
from  its  last  imputation  when  it  was  a  target  variable.  The  first  3  iterations 
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Table  3.1  Summary  of  Methods  for  Dealing  with  Missing  Values 


Method 

Deletion 

Single  Multiple 

Allows  nonrandom  missing 

— 

X  X 

Reduces  sample  size 

X 

—  — 

Apparent  S.E.  of  j3  too  low 

— 

X 

Increases  real  S.E.  of  /? 

X 

—  — 

/?  biased 

if  not  MCAR 

X 

of  the  process  are  ignored  (“burn-in”),  areglmpute  seems  to  perform  as  well  as 


Here  is  an  example  using  the  R  Hmisc  and  rms  packages. 

L 

a  4—  areglmpute  age  +  sex  +  bp  +  death  + 

heart . attack . bef ore . death  , 
data  =  mydata  ,  n.impute=5) 
f  V-  f it . mult . imput e ( death  ~  rcs(age,3)  +  sex  + 

res (bp, 5),  lrm  ,  a,  data  =  mydata) 
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MICE  but  runs  significantly  faster  and  allows  for  nonlinear  relationships. 
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3.9  Diagnostics 

One  diagnostic  that  can  be  helpful  in  assessing  the  MCAR  assumption  is  to 
compare  the  distribution  of  non-missing  Y  for  those  subjects  having  com¬ 
plete  X  with  those  having  incomplete  X.  On  the  other  hand,  Yucel  and 
Zaslavsky681  developed  a  diagnostic  that  is  useful  for  checking  the  imputa¬ 
tions  themselves.  In  solving  a  problem  related  to  imputing  binary  variables 
using  continuous  data  models,  they  proposed  a  simple  approach.  Suppose 
we  were  interested  in  the  reasonableness  of  imputed  values  for  a  sometimes- 
missing  predictor  Xj.  Duplicate  the  entire  dataset,  but  in  the  duplicated 
observations  set  all  values  of  Xj  to  missing.  Develop  imputed  values  for  the 
missing  values  of  Xj ,  and  in  the  observations  of  the  duplicated  portion  of  the 
dataset  corresponding  to  originally  non-missing  values  of  Xj,  compare  the 
distribution  of  imputed  Xj  with  the  original  values  of  Xj. 


3.10  Summary  and  Rough  Guidelines 

Table  3.1  summarizes  the  advantages  and  disadvantages  of  three  methods  of 
dealing  with  missing  data.  Here  “Single”  refers  to  single  conditional  mean  im¬ 
putation  (which  cannot  utilize  Y )  and  “Multiple”  refers  to  multiple  random- 
draw  imputation  (which  can  incorporate  Y). 


3.10  Summary  and  Rough  Guidelines 
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The  following  contains  crude  guidelines.  Simulation  studies  are  needed  to 
refine  the  recommendations.  Here  /  refers  to  the  proportion  of  observations 
having  any  variables  missing. 

/  <  0.03:  It  doesn’t  matter  very  much  how  you  impute  missings  or  whether 
you  adjust  variance  of  regression  coefficient  estimates  for  having  im¬ 
puted  data  in  this  case.  For  continuous  variables  imputing  missings  with 
the  median  non-missing  value  is  adequate;  for  categorical  predictors  the 
most  frequent  category  can  be  used.  Complete  case  analysis  is  also  an 
option  here.  Multiple  imputation  may  be  needed  to  check  that  the  simple 
approach  “worked.” 

/  >  0.03:  Use  multiple  imputation  with  number  of  imputations  equal  to 
max(5, 100/).  Fewer  imputations  may  be  possible  with  very  large  sample 
sizes.  Type  1  predictive  mean  matching  is  usually  preferred,  with  weighted 
selection  of  donors.  Account  for  imputation  in  estimating  the  covariance 
matrix  for  final  parameter  estimates.  Use  the  t  distribution  instead  of  the 
Gaussian  distribution  for  tests  and  confidence  intervals,  if  possible,  using 
the  estimated  d.f.  for  the  parameter  estimates. 

Multiple  predictors  frequently  missing:  More  imputations  may  be  required. 
Perform  a  “sensitivity  to  order”  analysis  by  creating  multiple  imputations 
using  different  orderings  of  sometimes  missing  variables.  It  may  be  ben¬ 
eficial  to  place  the  variable  with  the  highest  number  of  NAs  first  so  that 
initialization  of  other  missing  variables  to  medians  will  have  less  impact. 

It  is  important  to  note  that  the  reasons  for  missing  data  are  more  important 
determinants  of  how  missing  values  should  be  handled  than  is  the  quantity 
of  missing  values. 

If  the  main  interest  is  prediction  and  not  interpretation  or  inference  about 
individual  effects,  it  is  worth  trying  a  simple  imputation  (e.g.,  median  or  nor¬ 
mal  value  substitution)  to  see  if  the  resulting  model  predicts  the  response 
almost  as  well  as  one  developed  after  using  customized  imputation.  But  it 

is  not  appropriate  to  use  the  dummy  variable  or  extra  category  method, 

/\ 

because  these  methods  steal  information  from  Y  and  bias  all  /3s.  Clark  and 
Altman  0  presented  a  nice  example  of  the  use  of  multiple  imputation  for 
developing  a  prognostic  model.  Marshall  et  al.442  developed  a  useful  method 
for  obtaining  predictions  on  future  observations  when  some  of  the  needed 
predictors  are  unavailable.  Their  method  uses  an  approximate  re- fit  of  the 
original  model  for  available  predictors  only,  utilizing  only  the  coefficient  esti¬ 
mates  and  covariance  matrix  from  the  original  fit.  Little  and  An418  also  have 
an  excellent  review  of  imputation  methods  and  developed  several  approxi¬ 
mate  formulas  for  understanding  properties  of  various  estimators.  They  also 
developed  a  method  combining  imputation  of  missing  values  with  propensity 
score  modeling  of  the  probability  of  missingness. 
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3.11  Further  Reading 
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These  types  of  missing  data  are  well  described  in  an  excellent  review  article 
on  missing  data  by  Schafer  and  Graham542.  A  good  introductory  article  on 
missing  data  and  imputation  is  by  Donders  et  al.161  and  a  good  overview  of 
multiple  imputation  is  by  White  et  al.662  and  Harel  and  Zhou256.  Paul  Allison’s 
booklet12  and  van  Buuren’s  book85  are  also  excellent  practical  treatments. 
Crawford  et  al.138  give  an  example  where  responses  are  not  MCAR  for  which 
deleting  subjects  with  missing  responses  resulted  in  a  biased  estimate  of  the 
response  distribution.  They  found  that  multiple  imputation  of  the  response  re¬ 
sulted  in  much  improved  estimates.  Wood  et  al.673  have  a  good  review  of  how 
missing  response  data  are  typically  handled  in  randomized  trial  reports,  with 
recommendations  for  improvements.  Barnes  et  al.42  have  a  good  overview  of 
imputation  methods  and  a  comparison  of  bias  and  confidence  interval  cover¬ 
age  for  the  methods  when  applied  to  longitudinal  data  with  a  small  number 
of  subjects.  Twist  et  al.617  found  instability  in  using  multiple  imputation  of 
longitudinal  data,  and  advantages  of  using  instead  full  likelihood  models. 

See  van  Buuren  et  al.626  for  an  example  in  which  subjects  having  missing  base¬ 
line  blood  pressure  had  shorter  survival  time.  Joseph  et  al.327  provide  examples 
demonstrating  difficulties  with  casewise  deletion  and  single  imputation,  and 
comment  on  the  robustness  of  multiple  imputation  methods  to  violations  of 
assumptions. 

Another  problem  with  the  missingness  indicator  approach  arises  when  more 
than  one  predictor  is  missing  and  these  predictors  are  missing  on  almost  the 
same  subjects.  The  missingness  indicator  variables  will  be  collinear;  that  is 
impossible  to  disentangle.326 

See  [623,  pp.  2645-2646]  for  several  problems  with  the  “missing  category”  ap¬ 
proach.  A  clear  example  is  in161  where  covariates  Ai ,  A2  have  true  =  1,  @2  = 
0  and  Ai  is  MCAR.  Adding  a  missingness  indicator  for  Ai  as  a  covariate  re- 

A  A 

suited  in  =  0.55,  —  0.51  because  in  the  missing  observations  the  constant 

Ai  was  uncorrelated  with  A2.  D’Agostino  and  Rubin146  developed  methods  for 
propensity  score  modeling  that  allow  for  missing  data.  They  mentioned  that  ex¬ 
tra  categories  may  be  added  to  allow  for  missing  data  in  propensity  models  and 
that  adding  indicator  variables  describing  patterns  of  missingness  will  also  allow 
the  analyst  to  match  on  missingness  patterns  when  comparing  non-randomly 
assigned  treatments. 

Harel  and  Zhou256  and  Siddique569  discuss  the  approximate  Bayesian  bootstrap 
further. 

Kalton  and  Kasprzyk332  proposed  a  hybrid  approach  to  imputation  in  which 
missing  values  are  imputed  with  the  predicted  value  for  the  subject  plus  the 
residual  from  the  subject  having  the  closest  predicted  value  to  the  subject  being 
imputed. 

Miller  et  al.458  studied  the  effect  of  ignoring  imputation  when  conditional  mean 
fill-in  methods  are  used,  and  showed  how  to  formalize  such  methods  using  linear 
models. 

Meng455  argues  against  always  separating  imputation  from  final  analysis,  and 
in  favor  of  sometimes  incorporating  weights  into  the  process, 
van  Buuren  et  al.626  presented  an  excellent  case  study  in  multiple  imputation 
in  the  context  of  survival  analysis.  Barzi  and  Woodward43  present  a  nice  review 
of  multiple  imputation  with  detailed  comparison  of  results  (point  estimates  and 
confidence  limits  for  the  effect  of  the  sometimes-missing  predictor)  for  various 
imputation  methods.  Barnard  and  Rubin41  derived  an  estimate  of  the  d.f.  asso¬ 
ciated  with  the  imputation- adjusted  variance  matrix  for  use  in  a  t-distribution 
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approximation  for  hypothesis  tests  about  imputation-averaged  coefficient  es¬ 
timates.  When  d.f.  is  not  very  large,  the  t  approximation  will  result  in  more 
accurate  P-values  than  using  a  normal  approximation  that  we  use  with  Wald 
statistics  after  inserting  Equation  3.2  as  the  variance  matrix. 

Little  and  An418  present  imputation  methods  based  on  flexible  additive  regres¬ 
sion  models  using  penalized  cubic  splines.  Horton  and  Kleinman  01  compare 
several  software  packages  for  handling  missing  data  and  have  comparisons  of 
results  with  that  of  areglmpute.  Moons  et  al.463  compared  areglmpute  with 
MICE. 

He  and  Zaslavsky280  formalized  the  duplication  approach  to  imputation 
diagnostics. 

A  good  general  reference  on  missing  data  is  Little  and  Rubin,422  and  Volume  16, 
Nos.  1  to  3  of  Statistics  in  Medicine ,  a  large  issue  devoted  to  incomplete  covari¬ 
able  data.  Vach620  is  an  excellent  text  describing  properties  of  various  methods 
of  dealing  with  missing  data  in  binary  logistic  regression  (see  also  [621,622,624]). 
These  references  show  how  to  use  maximum  likelihood  to  explicitly  model  the 
missing  data  process.  Little  and  Rubin  show  how  imputation  can  be  avoided 
if  the  analyst  is  willing  to  assume  a  multivariate  distribution  for  the  joint  dis¬ 
tribution  of  X  and  Y .  Since  X  usually  contains  a  strange  mixture  of  binary, 
polytomous,  and  continuous  but  highly  skewed  predictors,  it  is  unlikely  that  this 
approach  will  work  optimally  in  many  problems.  That’s  the  reason  the  imputa¬ 
tion  approach  is  emphasized.  See  Rubin536  for  a  comprehensive  source  on  mul¬ 
tiple  imputation.  See  Little,419  Vach  and  Blettner,623  Rubin  and  Schenker,535 
Zhou  et  ah, 688  Greenland  and  Finkle,242  and  Hunsberger  et  al.313  for  excellent 
reviews  of  missing  data  problems  and  approaches  to  solving  them.  Reilly  and 
Pepe  have  a  nice  comparison  of  the  “hot-deck”  imputation  method  with  a  maxi¬ 
mum  likelihood-based  method.523  White  and  Carlin660  studied  bias  of  multiple 
imputation  vs.  complete  case  analysis. 


3.12  Problems 

The  SUPPORT  Study  (Study  to  Understand  Prognoses  Preferences  Out¬ 
comes  and  Risks  of  Treatments)  was  a  five-hospital  study  of  10,000  critically 
ill  hospitalized  adultsf352.  Patients  were  followed  for  in-hospital  outcomes  and 
for  long-term  survival.  We  analyze  35  variables  and  a  random  sample  of  1000 
patients  from  the  study. 

1.  Explore  the  variables  and  patterns  of  missing  data  in  the  SUPPORT 
dataset. 

a.  Print  univariable  summaries  of  ah  variables.  Make  a  plot  (showing  ah 
variables  on  one  page)  that  describes  especially  the  continuous  variables. 

b.  Make  a  plot  showing  the  extent  of  missing  data  and  tendencies  for  some 
variables  to  be  missing  on  the  same  patients.  Functions  in  the  Hmisc 
package  may  be  useful. 


1  The  dataset  is  on  the  book’s  dataset  wiki  and  may  be  automatically  fetched  over 
the  internet  and  loaded  using  the  Hmisc  package’s  command  getHdata ( support ) . 
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c.  Total  hospital  costs  (variable  totcst)  were  estimated  from  hospital- 
specific  Medicare  cost-to-charge  ratios.  Characterize  what  kind  of  pa¬ 
tients  have  missing  totcst.  For  this  characterization  use  the  follow¬ 
ing  patient  descriptors:  age,  sex,  dzgroup,  num.co,  edu,  income,  scoma, 
meanbp,  hrt,  resp,  temp. 

2.  Prepare  for  later  development  of  a  model  to  predict  costs  by  developing 
reliable  imputations  for  missing  costs.  Remove  the  observation  having  zero 
totcst. s 

a.  The  cost  estimates  are  not  available  on  105  patients.  Total  hospital 
charges  (bills)  are  available  on  all  but  25  patients.  Relate  these  two 
variables  to  each  other  with  an  eye  toward  using  charges  to  predict 
totcst  when  totcst  is  missing.  Make  graphs  that  will  tell  whether  lin¬ 
ear  regression  or  linear  regression  after  taking  logs  of  both  variables  is 
better. 

b.  Impute  missing  total  hospital  costs  in  SUPPORT  based  on  a  regression 
model  relating  charges  to  costs,  when  charges  are  available.  You  may 
want  to  use  a  statement  like  the  following  in  R: 

■ 

support  <—  tr ansf orm ( support  , 

totcst  =  if else ( is . na ( tot cst ) , 

( expression in charges ) ,  totcst)) 

If  in  the  previous  problem  you  felt  that  the  relationship  between  costs 
and  charges  should  be  based  on  taking  logs  of  both  variables,  the  “ex¬ 
pression  in  charges”  above  may  look  something  like  exp  (intercept  + 
slope  *  log  (charges) ) ,  where  constants  are  inserted  for  intercept  and 
slope. 

c.  Compute  the  likely  error  in  approximating  total  cost  using  charges  by 
computing  the  median  absolute  difference  between  predicted  and  ob¬ 
served  total  costs  in  the  patients  having  both  variables  available.  If  you 
used  a  log  transformation,  also  compute  the  median  absolute  percent 
error  in  imputing  total  costs  by  anti-logging  the  absolute  difference  in 
predicted  logs. 

3.  State  briefly  why  single  conditional  median  imputation  is  OK  here. 

4.  Use  transcan  to  develop  single  imputations  for  total  cost,  commenting  on 
the  strength  of  the  model  fitted  by  transcan  as  well  as  how  strongly  each 
variable  can  be  predicted  from  all  the  others. 

5.  Use  predictive  mean  matching  to  multiply  impute  cost  10  times  per  missing 
observation.  Describe  graphically  the  distributions  of  imputed  values  and 
briefly  compare  these  to  distributions  of  non-imputed  values.  State  in  a 

g  You  can  use  the  R  command  subset  (support ,  is  .na(totcst)  |  totcst  >  0).The 
is.na  condition  tells  R  that  it  is  permissible  to  include  observations  having  missing 
totcst  without  setting  all  columns  of  such  observations  to  NA. 

h  We  are  anti-logging  predicted  log  costs  and  we  assume  log  cost  has  a  symmetric 
distribution 
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simple  way  what  the  sample  variance  of  multiple  imputations  for  a  single 
observation  of  a  continuous  predictor  is  approximating. 

6.  Using  the  multiple  imputed  values,  develop  an  overall  least  squares  model 
for  total  cost  (using  the  log  transformation)  making  optimal  use  of  partial 
information,  with  variances  computed  so  as  to  take  imputation  (except  for 
cost)  into  account.  The  model  should  use  the  predictors  in  Problem  1  and 
should  not  assume  linearity  in  any  predictor  but  should  assume  additivity. 
Interpret  one  of  the  resulting  ratios  of  imputation-corrected  variance  to 
apparent  variance  and  explain  why  ratios  greater  than  one  do  not  mean 
that  imputation  is  inefficient. 


Chapter  4 

Multivariable  Modeling  Strategies 


Chapter  2  dealt  with  aspects  of  modeling  such  as  transformations  of  pre¬ 
dictors,  relaxing  linearity  assumptions,  modeling  interactions,  and  examining 
lack  of  fit.  Chapter  3  dealt  with  missing  data,  focusing  on  utilization  of  in¬ 
complete  predictor  information.  All  of  these  areas  are  important  in  the  overall 
scheme  of  model  development,  and  they  cannot  be  separated  from  what  is  to 
follow.  In  this  chapter  we  concern  ourselves  with  issues  related  to  the  whole 
model,  with  emphasis  on  deciding  on  the  amount  of  complexity  to  allow  in 
the  model  and  on  dealing  with  large  numbers  of  predictors.  The  chapter  con¬ 
cludes  with  three  default  modeling  strategies  depending  on  whether  the  goal 
is  prediction,  estimation,  or  hypothesis  testing. 

There  are  many  choices  to  be  made  when  deciding  upon  a  global  modeling 
strategy,  including  choice  between 

•  parametric  and  nonparametric  procedures 

•  parsimony  and  complexity 

•  parsimony  and  good  discrimination  ability 

•  interpretable  models  and  black  boxes. 

This  chapter  addresses  some  of  these  issues.  One  general  theme  of  what  fol¬ 
lows  is  the  idea  that  in  statistical  inference  when  a  method  is  capable  of 
worsening  performance  of  an  estimator  or  inferential  quantity  (i.e.,  when  the 
method  is  not  systematically  biased  in  one’s  favor),  the  analyst  is  allowed  to 
benefit  from  the  method.  Variable  selection  is  an  example  where  the  analysis 
is  systematically  tilted  in  one’s  favor  by  directly  selecting  variables  on  the 
basis  of  P- values  of  interest,  and  all  elements  of  the  final  result  (including 
regression  coefficients  and  P-values)  are  biased.  On  the  other  hand,  the  next 
section  is  an  example  of  the  “capitalize  on  the  benefit  when  it  works,  and 
the  method  may  hurt”  approach  because  one  may  reduce  the  complexity  of 
an  apparently  weak  predictor  by  removing  its  most  important  component — 
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nonlinear  effects — from  how  the  predictor  is  expressed  in  the  model.  The 
method  hides  tests  of  nonlinearity  that  would  systematically  bias  the  final 
result. 

The  book’s  web  site  contains  a  number  of  simulation  studies  and  references 
to  others  that  support  the  advocated  approaches. 


4.1  Prespecification  of  Predictor  Complexity  Without 
Later  Simplification 

There  are  rare  occasions  in  which  one  actually  expects  a  relationship  to  be 
linear.  For  example,  one  might  predict  mean  arterial  blood  pressure  at  two 
months  after  beginning  drug  administration  using  as  baseline  variables  the 
pretreatment  mean  blood  pressure  and  other  variables.  In  this  case  one  ex¬ 
pects  the  pretreatment  blood  pressure  to  linearly  relate  to  follow-up  blood 
pressure,  and  modeling  is  simpleT  In  the  vast  majority  of  studies,  however, 
there  is  every  reason  to  suppose  that  all  relationships  involving  nonbinary 
predictors  are  nonlinear.  In  these  cases,  the  only  reason  to  represent  pre¬ 
dictors  linearly  in  the  model  is  that  there  is  insufficient  information  in  the 
sample  to  allow  us  to  reliably  fit  nonlinear  relationships.13 

Supposing  that  nonlinearities  are  entertained,  analysts  often  use  scatter 
diagrams  or  descriptive  statistics  to  decide  how  to  represent  variables  in  a 
model.  The  result  will  often  be  an  adequately  fitting  model,  but  confidence 
limits  will  be  too  narrow,  P-values  too  small,  R 2  too  large,  and  calibration 
too  good  to  be  true.  The  reason  is  that  the  “phantom  d.f.”  that  represented 
potential  complexities  in  the  model  that  were  dismissed  during  the  subjective 
assessments  are  forgotten  in  computing  standard  errors,  P-values,  and  R^d- . 
The  same  problem  is  created  when  one  entertains  several  transformations 
(log,  ^/,  etc.)  and  uses  the  data  to  see  which  one  fits  best,  or  when  one  tries 
to  simplify  a  spline  fit  to  a  simple  transformation. 

An  approach  that  solves  this  problem  is  to  prespecify  the  complexity  with 
which  each  predictor  is  represented  in  the  model,  without  later  simplification 
of  the  model.  The  amount  of  complexity  (e.g.,  number  of  knots  in  spline  func¬ 
tions  or  order  of  ordinary  polynomials)  one  can  afford  to  fit  is  roughly  related 
to  the  “effective  sample  size.”  It  is  also  very  reasonable  to  allow  for  greater 
complexity  for  predictors  that  are  thought  to  be  more  powerfully  related  to 
Y.  For  example,  errors  in  estimating  the  curvature  of  a  regression  function  are 
consequential  in  predicting  Y  only  when  the  regression  is  somewhere  steep. 
Once  the  analyst  decides  to  include  a  predictor  in  every  model,  it  is  fair  to 

a  Even  then,  the  two  blood  pressures  may  need  to  be  transformed  to  meet  distribu¬ 
tional  assumptions. 

b  Shrinkage  (penalized  estimation)  is  a  general  solution  (see  Section  4.5).  One  can 
always  use  complex  models  that  are  “penalized  towards  simplicity,”  with  the  amount 
of  penalization  being  greater  for  smaller  sample  sizes. 
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use  general  measures  of  association  to  quantify  the  predictive  potential  for 
a  variable.  For  example,  if  a  predictor  has  a  low  rank  correlation  with  the 
response,  it  will  not  “pay”  to  devote  many  degrees  of  freedom  to  that  pre¬ 
dictor  in  a  spline  function  having  many  knots.  On  the  other  hand,  a  potent 
predictor  (with  a  high  rank  correlation)  not  known  to  act  linearly  might  be 
assigned  five  knots  if  the  sample  size  allows. 

When  the  effective  sample  size  available  is  sufficiently  large  so  that  a  satu¬ 
rated  main  effects  model  may  be  fitted,  a  good  approach  to  gauging  predictive 
potential  is  the  following. 

•  Let  all  continuous  predictors  be  represented  as  restricted  cubic  splines  with 
k  knots,  where  k  is  the  maximum  number  of  knots  the  analyst  entertains 
for  the  current  problem. 

•  Let  all  categorical  predictors  retain  their  original  categories  except  for 
pooling  of  very  low  prevalence  categories  (e.g.,  ones  containing  <  6  obser¬ 
vations). 

•  Fit  this  general  main  effects  model. 

•  Compute  the  partial  y2  statistic  for  testing  the  association  of  each  pre¬ 
dictor  with  the  response,  adjusted  for  all  other  predictors.  In  the  case  of 
ordinary  regression,  convert  partial  F  statistics  to  y2  statistics  or  partial 
R 2  values. 

•  Make  corrections  for  chance  associations  to  “level  the  playing  field”  for  pre¬ 
dictors  having  greatly  varying  d.f.,  e.g.,  subtract  the  d.f.  from  the  partial 
y2  (the  expected  value  of  y2  is  p  under  Ho). 

•  Make  certain  that  tests  of  nonlinearity  are  not  revealed  as  this  would  bias 
the  analyst. 

•  Sort  the  partial  association  statistics  in  descending  order. 

Commands  in  the  rms  package  can  be  used  to  plot  only  what  is  needed. 
Here  is  an  example  for  a  logistic  model. 

L 

f  V-  lrm(y  ~  sex  +  race  +  res (age ,5)  +  rcs(weight ,5)  + 

res (height  ,  5)  +  res (blood . pressure  ,  5)  ) 

plot ( anova  ( f ) ) 

This  approach,  and  the  rank  correlation  approach  about  to  be  discussed, 
do  not  require  the  analyst  to  really  prespecify  predictor  complexity,  so  how 
are  they  not  biased  in  our  favor?  There  are  two  reasons:  the  analyst  has  al¬ 
ready  agreed  to  retain  the  variable  in  the  model  even  if  the  strength  of  the 
association  is  very  low,  and  the  assessment  of  association  does  not  reveal 
the  degree  of  nonlinearity  of  the  predictor  to  allow  the  analyst  to  “tweak” 
the  number  of  knots  or  to  discard  nonlinear  terms.  Any  predictive  ability  a 
variable  might  have  may  be  concentrated  in  its  nonlinear  effects,  so  using 
the  total  association  measure  for  a  predictor  to  save  degrees  of  freedom  by 
restricting  the  variable  to  be  linear  may  result  in  no  predictive  ability.  Like¬ 
wise,  a  low  association  measure  between  a  categorical  variable  and  Y  might 
lead  the  analyst  to  collapse  some  of  the  categories  based  on  their  frequencies. 
This  often  helps,  but  sometimes  the  categories  that  are  so  combined  are  the 
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ones  that  are  most  different  from  one  another.  So  if  using  partial  tests  or 
rank  correlation  to  reduce  degrees  of  freedom  can  harm  the  model,  one  might 
argue  that  it  is  fair  to  allow  this  strategy  to  also  benefit  the  analysis. 

When  collinearities  or  confounding  are  not  problematic,  a  quicker  approach 
based  on  pairwise  measures  of  association  can  be  useful.  This  approach  will 
not  have  numerical  problems  (e.g.,  singular  covariance  matrix).  When  Y  is 
binary  or  continuous  (but  not  censored),  a  good  general-purpose  measure  of 
association  that  is  useful  in  making  decisions  about  the  number  of  parameters 
to  devote  to  a  predictor  is  an  extension  of  Spearman’s  p  rank  correlation. 
This  is  the  ordinary  R 2  from  predicting  the  rank  of  Y  based  on  the  rank  of 
X  and  the  square  of  the  rank  of  X.  This  p2  will  detect  not  only  nonlinear 
relationships  (as  will  ordinary  Spearman  p)  but  some  non-monotonic  ones 
as  well.  It  is  important  that  the  ordinary  Spearman  p  not  be  computed,  as 
this  would  tempt  the  analyst  to  simplify  the  regression  function  (towards 
monotonicity)  if  the  generalized  p2  does  not  significantly  exceed  the  square 
of  the  ordinary  Spearman  p.  For  categorical  predictors,  ranks  are  not  squared 
but  instead  the  predictor  is  represented  by  a  series  of  dummy  variables.  The 
resulting  p2  is  related  to  the  Kruskal-Wallis  test.  See  p.  460  for  an  example. 
Note  that  bivariable  correlations  can  be  misleading  if  marginal  relationships 
vary  greatly  from  ones  obtained  after  adjusting  for  other  predictors. 

Once  one  expands  a  predictor  into  linear  and  nonlinear  terms  and  esti¬ 
mates  the  coefficients,  the  best  way  to  understand  the  relationship  between 
predictors  and  response  is  to  graph  this  estimated  relationship0.  If  the  plot 
appears  almost  linear  or  the  test  of  nonlinearity  is  very  insignificant  there 
is  a  temptation  to  simplify  the  model.  The  Grambsch  and  O’Brien  result 
described  in  Section  2.6  demonstrates  why  this  is  a  bad  idea. 

From  the  above  discussion  a  general  principle  emerges.  Whenever  the  re¬ 
sponse  variable  is  informally  or  formally  linked,  in  an  unmasked  fashion,  to 
particular  parameters  that  may  be  deleted  from  the  model,  special  adjust¬ 
ments  must  be  made  in  P- values,  standard  errors,  test  statistics,  and  confi¬ 
dence  limits,  in  order  for  these  statistics  to  have  the  correct  interpretation. 
Examples  of  strategies  that  are  improper  without  special  adjustments  (e.g., 
using  the  bootstrap)  include  examining  a  frequency  table  or  scatterplot  to 
decide  that  an  association  is  too  weak  for  the  predictor  to  be  included  in 
the  model  at  all  or  to  decide  that  the  relationship  appears  so  linear  that  all 
nonlinear  terms  should  be  omitted.  It  is  also  valuable  to  consider  the  reverse 
situation;  that  is,  one  posits  a  simple  model  and  then  additional  analysis  or 
outside  subject  matter  information  makes  the  analyst  want  to  generalize  the 
model.  Once  the  model  is  generalized  (e.g.,  nonlinear  terms  are  added),  the 
test  of  association  can  be  recomputed  using  multiple  d.f.  So  another  general 
principle  is  that  when  one  makes  the  model  more  complex,  the  d.f.  prop¬ 
erly  increases  and  the  new  test  statistics  for  association  have  the  claimed 

c  One  can  also  perform  a  joint  test  of  all  parameters  associated  with  nonlinear  effects. 
This  can  be  useful  in  demonstrating  to  the  reader  that  some  complexity  was  actually 
needed. 
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distribution.  Thus  moving  from  simple  to  more  complex  models  presents  no 
problems  other  than  conservatism  if  the  new  complex  components  are  truly 
unnecessary. 


4.2  Checking  Assumptions  of  Multiple  Predictors 
Simultaneously 

Before  developing  a  multivariable  model  one  must  decide  whether  the  as¬ 
sumptions  of  each  continuous  predictor  can  be  verified  by  ignoring  the  effects 
of  all  other  potential  predictors.  In  some  cases,  the  shape  of  the  relation¬ 
ship  between  a  predictor  and  the  property  of  response  will  be  different  if  an 
adjustment  is  made  for  other  correlated  factors  when  deriving  regression  esti¬ 
mates.  Also,  failure  to  adjust  for  an  important  factor  can  frequently  alter  the 
nature  of  the  distribution  of  Y .  Occasionally,  however,  it  is  unwieldy  to  deal 
simultaneously  with  all  predictors  at  each  stage  in  the  analysis,  and  instead 
the  regression  function  shapes  are  assessed  separately  for  each  continuous 
predictor. 


4.3  Variable  Selection 

The  material  covered  to  this  point  dealt  with  a  prespecified  list  of  variables 
to  be  included  in  the  regression  model.  For  reasons  of  developing  a  concise 
model  or  because  of  a  fear  of  collinearity  or  of  a  false  belief  that  it  is  not 
legitimate  to  include  ‘'insignificant”  regression  coefficients  when  presenting 
results  to  the  intended  audience,  stepwise  variable  selection  is  very  commonly 
employed.  Variable  selection  is  used  when  the  analyst  is  faced  with  a  series  of 
potential  predictors  but  does  not  have  (or  use)  the  necessary  subject  matter 
knowledge  to  enable  her  to  prespecify  the  “important”  variables  to  include 
in  the  model.  But  using  Y  to  compute  P-values  to  decide  which  variables 
to  include  is  similar  to  using  Y  to  decide  how  to  pool  treatments  in  a  five- 
treatment  randomized  trial,  and  then  testing  for  global  treatment  differences 
using  fewer  than  four  degrees  of  freedom. 

Stepwise  variable  selection  has  been  a  very  popular  technique  for  many 
years,  but  if  this  procedure  had  just  been  proposed  as  a  statistical  method,  it 
would  most  likely  be  rejected  because  it  violates  every  principle  of  statistical 
estimation  and  hypothesis  testing.  Here  is  a  summary  of  the  problems  with 
this  method. 
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1.  It  yields  R 2  values  that  are  biased  high. 

2.  The  ordinary  F  and  y2  test  statistics  do  not  have  the  claimed  distribu¬ 
tion1.234  Variable  selection  is  based  on  methods  (e.g.,  F  tests  for  nested 
models)  that  were  intended  to  be  used  to  test  only  prespecified  hypotheses. 

3.  The  method  yields  standard  errors  of  regression  coefficient  estimates  that 
are  biased  low  and  confidence  intervals  for  effects  and  predicted  values  that 
are  falsely  narrow.16 

4.  It  yields  P- values  that  are  too  small  (i.e.,  there  are  severe  multiple  compar¬ 
ison  problems)  and  that  do  not  have  the  proper  meaning,  and  the  proper 
correction  for  them  is  a  very  difficult  problem. 

5.  It  provides  regression  coefficients  that  are  biased  high  in  absolute  value 
and  need  shrinkage.  Even  if  only  a  single  predictor  were  being  analyzed 
and  one  only  reported  the  regression  coefficient  for  that  predictor  if  its 

association  with  Y  were  “statistically  significant,”  the  estimate  of  the  re- 

/\ 

gression  coefficient  /?  is  biased  (too  large  in  absolute  value).  To  put  this 
in  symbols  for  the  case  where  we  obtain  a  positive  association  (/3  >  0), 
E(j3\P  <  0.05,  p  >  0)  >  /3.100 

6.  In  observational  studies,  variable  selection  to  determine  confounders  for 
adjustment  results  in  residual  confounding241. 

7.  Rather  than  solving  problems  caused  by  collinearity,  variable  selection  is 
made  arbitrary  by  collinearity. 

8.  It  allows  us  to  not  think  about  the  problem. 

The  problems  of  P-value-based  variable  selection  are  exacerbated  when  the 
analyst  (as  she  so  often  does)  interprets  the  final  model  as  if  it  were  pre¬ 
specified.  Copas  and  Long125  stated  one  of  the  most  serious  problems  with 
stepwise  modeling  eloquently  when  they  said,  “The  choice  of  the  variables 
to  be  included  depends  on  estimated  regression  coefficients  rather  than  their 
true  values,  and  so  X3  is  more  likely  to  be  included  if  its  regression  coefficient 
is  over-estimated  than  if  its  regression  coefficient  is  underestimated.”  Derksen 
and  Keselman  )5  studied  stepwise  variable  selection,  backward  elimination, 
and  forward  selection,  with  these  conclusions: 

1.  “The  degree  of  correlation  between  the  predictor  variables  affected  the  fre¬ 
quency  with  which  authentic  predictor  variables  found  their  way  into  the 
final  model. 

2.  The  number  of  candidate  predictor  variables  affected  the  number  of  noise 
variables  that  gained  entry  to  the  model. 

3.  The  size  of  the  sample  was  of  little  practical  importance  in  determining  the 
number  of  authentic  variables  contained  in  the  final  model. 


d  Lockhart  et  al.425  provide  an  example  with  n  =  100  and  10  orthogonal  predictors 
where  all  true  /3s  are  zero.  The  test  statistic  for  the  first  variable  to  enter  has  type  I 
error  of  0.39  when  the  nominal  a  is  set  to  0.05,  in  line  with  what  one  would  expect 
with  multiple  testing  using  1  —  0.9510  =  0.40. 
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4.  The  population  multiple  coefficient  of  determination  could  be  faithfully  es¬ 
timated  by  adopting  a  statistic  that  is  adjusted  by  the  total  number  of 
candidate  predictor  variables  rather  than  the  number  of  variables  in  the 
final  model.” 

They  found  that  variables  selected  for  the  final  model  represented  noise  0.20 
to  0.74  of  the  time  and  that  the  final  model  usually  contained  less  than  half 
of  the  actual  number  of  authentic  predictors.  Hence  there  are  many  reasons 
for  using  methods  such  as  full-model  fits  or  data  reduction,  instead  of  using 
any  stepwise  variable  selection  algorithm. 

If  stepwise  selection  must  be  used,  a  global  test  of  no  regression  should 
be  made  before  proceeding,  simultaneously  testing  all  candidate  predictors 
and  having  degrees  of  freedom  equal  to  the  number  of  candidate  variables 
(plus  any  nonlinear  or  interaction  terms).  If  this  global  test  is  not  significant, 
selection  of  individually  significant  predictors  is  usually  not  warranted. 

The  method  generally  used  for  such  variable  selection  is  forward  selection 
of  the  most  significant  candidate  or  backward  elimination  of  the  least  sig¬ 
nificant  predictor  in  the  model.  One  of  the  recommended  stopping  rules  is 
based  on  the  “residual  y2”  with  degrees  of  freedom  equal  to  the  number  of 
candidate  variables  remaining  at  the  current  step.  The  residual  y2  can  be 
tested  for  significance  (if  one  is  able  to  forget  that  because  of  variable  selec¬ 
tion  this  statistic  does  not  have  a  y2  distribution),  or  the  stopping  rule  can 
be  based  on  Akaike’s  information  criterion  (AIC33),  here  residual  x2  -  2x 
d.f.  Of  course,  use  of  more  insight  from  knowledge  of  the  subject  matter 
will  generally  improve  the  modeling  process  substantially.  It  must  be  remem¬ 
bered  that  no  currently  available  stopping  rule  was  developed  for  data-driven 
variable  selection.  Stopping  rules  such  as  AIC  or  Mallows’  Cp  are  intended 
for  comparing  a  limited  number  of  prespecified  models  [66,  Section  1 .3] 34  ^ e . 

If  the  analyst  insists  on  basing  the  stopping  rule  on  P- values,  the  optimum 
(in  terms  of  predictive  accuracy)  a  to  use  in  deciding  which  variables  to 
include  in  the  model  is  a  =  1.0  unless  there  are  a  few  powerful  variables 
and  several  completely  irrelevant  variables.  A  reasonable  a  that  does  allow 
for  deletion  of  some  variables  is  a  =  0.5. 585  These  values  are  far  from  the 
traditional  choices  of  a  =  0.05  or  0.10. 

e  AIC  works  successfully  when  the  models  being  entertained  are  on  a  progression 
defined  by  a  single  parameter,  e.g.  a  common  shrinkage  coefficient  or  the  single  num¬ 
ber  of  knots  to  be  used  by  all  continuous  predictors.  AIC  can  also  work  when  the 
model  that  is  best  by  AIC  is  much  better  than  the  runner-up  so  that  if  the  process 
were  bootstrapped  the  same  model  would  almost  always  be  found.  When  used  for 
one  variable  at  a  time  variable  selection.  AIC  is  just  a  restatement  of  the  P- value, 
and  as  such,  doesn’t  solve  the  severe  problems  with  stepwise  variable  selection  other 
than  forcing  us  to  use  slightly  more  sensible  a  values.  Burnham  and  Anderson84  rec¬ 
ommend  selection  based  on  AIC  for  a  limited  number  of  theoretically  well-founded 
models.  Some  statisticians  try  to  deal  with  multiplicity  problems  caused  by  stepwise 
variable  selection  by  making  a  smaller  than  0.05.  This  increases  bias  by  giving  vari¬ 
ables  whose  effects  are  estimated  with  error  a  greater  relative  chance  of  being  selected. 
Variable  selection  does  not  compete  well  with  shrinkage  methods  that  simultaneously 
model  all  potential  predictors. 
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Even  though  forward  stepwise  variable  selection  is  the  most  commonly 
used  method,  the  step-down  method  is  preferred  for  the  following  reasons. 

1.  It  usually  performs  better  than  forward  stepwise  methods,  especially  when 
collinearity  is  present.437 

2.  It  makes  one  examine  a  full  model  fit,  which  is  the  only  fit  providing 
accurate  standard  errors,  error  mean  square,  and  P- values. 

3.  The  method  of  Lawless  and  Singhal385  allows  extremely  efficient  step-down 
modeling  using  Wald  statistics,  in  the  context  of  any  fit  from  least  squares 
or  maximum  likelihood.  This  method  requires  passing  through  the  data 
matrix  only  to  get  the  initial  full  fit. 

For  a  given  dataset,  bootstrapping  (Efron  et  al.150,  172’17L 178^  can  pe}p 
decide  between  using  full  and  reduced  models.  Bootstrapping  can  be  done 
on  the  whole  model  and  compared  with  bootstrapped  estimates  of  predictive 
accuracy  based  on  stepwise  variable  selection  for  each  resample.  Unless  most 
predictors  are  either  very  significant  or  clearly  unimportant,  the  full  model 
usually  outperforms  the  reduced  model. 

Full  model  fits  have  the  advantage  of  providing  meaningful  confidence 
intervals  using  standard  formulas.  Altman  and  Andersen 16  gave  an  example 
in  which  the  lengths  of  confidence  intervals  of  predicted  survival  probabilities 
were  60%  longer  when  bootstrapping  was  used  to  estimate  the  simultaneous 
effects  of  variability  caused  by  variable  selection  and  coefficient  estimation,  as 
compared  with  confidence  intervals  computed  ignoring  how  a  “final”  model 
came  to  be.  On  the  other  hand,  models  developed  on  full  fits  after  data 
reduction  will  be  optimum  in  many  cases. 

In  some  cases  you  may  want  to  use  the  full  model  for  prediction  and  vari¬ 
able  selection  for  a  “best  bet”  parsimonious  list  of  independently  important 
predictors.  This  could  be  accompanied  by  a  list  of  variables  selected  in  50 
bootstrap  samples  to  demonstrate  the  imprecision  in  the  “best  bet.” 

Sauerbrei  and  Schumacher541  present  a  method  to  use  bootstrapping  to 
actually  select  the  set  of  variables.  However,  there  are  a  number  of  drawbacks 
to  this  approach 35 : 

1.  The  choice  of  an  a  cutoff  for  determining  whether  a  variable  is  retained  in 
a  given  bootstrap  sample  is  arbitrary. 

2.  The  choice  of  a  cutoff  for  the  proportion  of  bootstrap  samples  for  which  a 
variable  is  retained,  in  order  to  include  that  variable  in  the  final  model,  is 
somewhat  arbitrary. 

3.  Selection  from  among  a  set  of  correlated  predictors  is  arbitrary,  and  all 
highly  correlated  predictors  may  have  a  low  bootstrap  selection  frequency. 
It  may  be  the  case  that  none  of  them  will  be  selected  for  the  final  model 
even  though  when  considered  individually  each  of  them  may  be  highly 
significant. 
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4.  By  using  the  bootstrap  to  choose  variables,  one  must  use  the  double  boot¬ 
strap  to  resample  the  entire  modeling  process  in  order  to  validate  the  model 
and  to  derive  reliable  confidence  intervals.  This  may  be  computationally 
prohibitive. 

5.  The  bootstrap  did  not  improve  upon  traditional  backward  stepdown  vari¬ 
able  selection.  Both  methods  fail  at  identifying  the  “correct”  variables. 

For  some  applications  the  list  of  variables  selected  may  be  stabilized  by 
grouping  variables  according  to  subject  matter  considerations  or  empirical 
correlations  and  testing  each  related  group  with  a  multiple  degree  of  freedom 
test.  Then  the  entire  group  may  be  kept  or  deleted  and,  if  desired,  groups  that 
are  retained  can  be  summarized  into  a  single  variable  or  the  most  accurately 
measured  variable  within  the  group  can  replace  the  group.  See  Section  4.7 
for  more  on  this. 

Kass  and  Raftery337  showed  that  Bayes  factors  have  several  advantages  in 
variable  selection,  including  the  selection  of  less  complex  models  that  may 
agree  better  with  subject  matter  knowledge.  However,  as  in  the  case  with 
more  traditional  stopping  rules,  the  final  model  may  still  have  regression 
coefficients  that  are  too  large.  This  problem  is  solved  by  Tibshirani’s  lasso 
method,608,609  which  is  a  penalized  estimation  technique  in  which  the  esti¬ 
mated  regression  coefficients  are  constrained  so  that  the  sum  of  their  scaled 
absolute  values  falls  below  some  constant  k  chosen  by  cross-validation.  This 
kind  of  constraint  forces  some  regression  coefficient  estimates  to  be  exactly 
zero,  thus  achieving  variable  selection  while  shrinking  the  remaining  coef¬ 
ficients  toward  zero  to  reflect  the  overfitting  caused  by  data-based  model 
selection. 

A  final  problem  with  variable  selection  is  illustrated  by  comparing  this 
approach  with  the  sensible  way  many  economists  develop  regression  mod¬ 
els.  Economists  frequently  use  the  strategy  of  deleting  only  those  variables 
that  are  “insignificant”  and  whose  regression  coefficients  have  a  nonsensible 
direction.  Standard  variable  selection  on  the  other  hand  yields  biologically 
implausible  findings  in  many  cases  by  setting  certain  regression  coefficients 
exactly  to  zero.  In  a  study  of  survival  time  for  patients  with  heart  failure, 
for  example,  it  would  be  implausible  that  patients  having  a  specific  symptom 
live  exactly  as  long  as  those  without  the  symptom  just  because  the  symp¬ 
tom’s  regression  coefficient  was  “insignificant.”  The  lasso  method  shares  this 
difficulty  with  ordinary  variable  selection  methods  and  with  any  method  that 
in  the  Bayesian  context  places  nonzero  prior  probability  on  j3  being  exactly 
zero. 

Many  papers  claim  that  there  were  insufficient  data  to  allow  for  multivari¬ 
able  modeling,  so  they  did  “univariable  screening”  wherein  only  “significant” 
variables  (i.e.,  those  that  are  separately  significantly  associated  with  Y)  were 
entered  into  the  model.  This  is  just  a  forward  stepwise  variable  selection  in 


f  This  is  akin  to  doing  a  t-test  to  compare  the  two  treatments  (out  of  10,  say)  that 
are  apparently  most  different  from  each  other. 
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which  insignificant  variables  from  the  first  step  are  not  reanalyzed  in  later 
steps.  Univariable  screening  is  thus  even  worse  than  stepwise  modeling  as 
it  can  miss  important  variables  that  are  only  important  after  adjusting  for 
other  variables.598  Overall,  neither  univariable  screening  nor  stepwise  vari¬ 
able  selection  in  any  way  solves  the  problem  of  “too  many  variables,  too  few 
subjects,”  and  they  cause  severe  biases  in  the  resulting  multivariable  model 
fits  while  losing  valuable  predictive  information  from  deleting  marginally  sig¬ 
nificant  variables. 

The  online  course  notes  contain  a  simple  simulation  study  of  stepwise 
selection  using  R. 
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When  a  model  is  fitted  that  is  too  complex,  that  it,  has  too  many  free  pa¬ 
rameters  to  estimate  for  the  amount  of  information  in  the  data,  the  worth 
of  the  model  (e.g.,  R 2)  will  be  exaggerated  and  future  observed  values  will 
not  agree  with  predicted  values.  In  this  situation,  overfitting  is  said  to  be 
present,  and  some  of  the  findings  of  the  analysis  come  from  fitting  noise  and 
not  just  signal,  or  finding  spurious  associations  between  X  and  Y .  In  this  sec¬ 
tion  general  guidelines  for  preventing  overfitting  are  given.  Here  we  concern 
ourselves  with  the  reliability  or  calibration  of  a  model,  meaning  the  ability  of 
the  model  to  predict  future  observations  as  well  as  it  appeared  to  predict  the 
responses  at  hand.  For  now  we  avoid  judging  whether  the  model  is  adequate 
for  the  task,  but  restrict  our  attention  to  the  likelihood  that  the  model  has 
significantly  overfitted  the  data. 

In  typical  low  signal-to-noise  ratio  situations8,  model  validations  on  in¬ 
dependent  datasets  have  found  the  minimum  training  sample  size  for  which 
the  fitted  model  has  an  independently  validated  predictive  discrimination 
that  equals  the  apparent  discrimination  seen  with  in  training  sample.  Similar 
validation  experiments  have  considered  the  margin  of  error  in  estimating  an 
absolute  quantity  such  as  event  probability.  Studies  such  as268,270,577  have 
shown  that  in  many  situations  a  fitted  regression  model  is  likely  to  be  reli¬ 
able  when  the  number  of  predictors  (or  candidate  predictors  if  using  variable 
selection)  p  is  less  than  m/10  or  m/20,  where  m  is  the  “limiting  sample  size” 
given  in  Table  4.1.  A  good  average  requirement  is  p  <  For  example, 
Smith  et  al.  found  in  one  series  of  simulations  that  the  expected  error  in 
Cox  model  predicted  five-year  survival  probabilities  was  below  0.05  when 
p  <  m/20  for  “average”  subjects  and  below  0.10  when  p  <  m/20  for  “sick” 

g  These  are  situations  where  the  true  R2  is  low,  unlike  tightly  controlled  experiments 
and  mechanistic  models  where  signalmoise  ratios  can  be  quite  high.  In  those  situ¬ 
ations,  many  parameters  can  be  estimated  from  small  samples,  and  the  Ur  rule  of 
thumb  can  be  significantly  relaxed. 
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Table  4.1  Limiting  Sample  Sizes  for  Various  Response  Variables 

Type  of  Response  Variable  Limiting  Sample  Size  m 
Continuous  n  (total  sample  size) 

Binary  min(ni,n2)h 

Ordinal  (k  categories)  n  —  X^=i  ni *  1 

Failure  (survival)  time  number  of  failures  J 


subjects,  where  m  is  the  number  of  deaths.  For  “average”  subjects,  m/10  was 
adequate  for  preventing  expected  errors  >  0.1.  Note:  The  number  of  non¬ 
intercept  parameters  in  the  model  (p)  is  usually  greater  than  the  number  of 
predictors.  Narrowly  distributed  predictor  variables  (e.g.,  if  all  subjects’  ages 
are  between  30  and  45  or  only  5%  of  subjects  are  female)  will  require  even 
higher  sample  sizes.  Note  that  the  number  of  candidate  variables  must  include 
all  variables  screened  for  association  with  the  response,  including  nonlinear 
terms  and  interactions.  Instead  of  relying  on  the  rules  of  thumb  in  the  table, 
the  shrinkage  factor  estimate  presented  in  the  next  section  can  be  used  to 
guide  the  analyst  in  determining  how  many  d.f.  to  model  (see  p.  87). 

Rules  of  thumb  such  as  the  15:1  rule  do  not  consider  that  a  certain  min¬ 
imum  sample  size  is  needed  just  to  estimate  basic  parameters  such  as  an 
intercept  or  residual  variance.  This  is  dealt  with  in  upcoming  topics  about 
specific  models.  For  the  case  of  ordinary  linear  regression,  estimation  of  the 
residual  variance  is  central.  All  standard  errors,  P-values,  confidence  inter¬ 
vals,  and  R 2  depend  on  having  a  precise  estimate  of  a2.  The  one-sample 
problem  of  estimating  a  mean,  which  is  equivalent  to  a  linear  model  contain¬ 
ing  only  an  intercept,  is  the  easiest  case  when  estimating  a2 .  When  a  sample 
of  size  n  is  drawn  from  a  normal  distribution,  a  1  —  a  two-sided  confidence 
interval  for  the  unknown  population  variance  a2  is  given  by 

n-  1  2^2^  n~ 1  2 

— 2 - 5  <  a  <  — 2 - s  , 

A 1  — a/2,n  — 1  ^a/2,n—  1 


h  See  [487].  If  one  considers  the  power  of  a  two-sample  binomial  test  compared 
with  a  Wilcoxon  test  if  the  response  could  be  made  continuous  and  the  propor¬ 
tional  odds  assumption  holds,  the  effective  sample  size  for  a  binary  response  is 
3niU2/n  ~  3min(ni,n2)  if  n±/n  is  near  0  or  1  [664,  Eq.  10,  15].  Here  n\  and  n 2 
are  the  marginal  frequencies  of  the  two  response  levels. 

1  Based  on  the  power  of  a  proportional  odds  model  two-sample  test  when  the  marginal 
cell  sizes  for  the  response  are  m, . .  . ,  n^,  compared  with  all  cell  sizes  equal  to  unity 
(response  is  continuous)  [664,  Eq,  3].  If  all  cell  sizes  are  equal,  the  relative  efficiency 
of  having  k  response  categories  compared  with  a  continuous  response  is  1  —  1  /k2  [664, 
Eq.  14];  for  example,  a  five-level  response  is  almost  as  efficient  as  a  continuous  one  if 
proportional  odds  holds  across  category  cutoffs. 

j  This  is  approximate,  as  the  effective  sample  size  may  sometimes  be  boosted  some¬ 
what  by  censored  observations,  especially  for  non-proportional  hazards  methods  such 
as  Wilcoxon-type  tests.49 


(4.1) 
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where  s 2  is  the  sample  variance  and  Xa,n-i  the  ol  critical  value  of  the 
X2  distribution  with  n  —  1  degrees  of  freedom.  We  take  the  fold-change  or 
multiplicative  margin  of  error  (MMOE)  for  estimating  a  to  be 


\ 


max( 


A].—  a/2,n—  1 
Tl  —  1 


) 


1 


To  achieve  a  MMOE  of  no  worse  than  E2  with  0.95  confidence  when 
estimating  a  requires  a  sample  size  of  70  subjects. 

The  linear  model  case  is  useful  for  examining  n  :  p  ratio  another  way.  As 
discussed  in  the  next  section,  R 2dj  is  a  nearly  unbiased  estimate  of  R2 ,  i.e., 
is  not  inflated  by  overfitting  if  the  value  used  for  p  is  “honest”,  i.e.,  includes 
all  variables  screened.  We  can  ask  the  question  “for  a  given  A*2,  what  ratio  of 
n  :  p  is  required  so  that  R2d j  does  not  drop  by  more  than  a  certain  relative  or 
absolute  amount  from  the  value  of  A*2?”  This  assessment  takes  into  account 
that  higher  signaknoise  ratios  allow  fitting  more  variables.  For  example,  with 


Fig.  4.1  Multiple  of  p  that  n  must  be  to  achieve  a  relative  drop  from  R 2  to  R^d j  by 
the  indicated  relative  factor  (left  panel,  3  factors)  or  absolute  difference  (right  panel, 
6  decrements) 


low  R2  a  100:1  ratio  of  n  :  p  may  be  required  to  prevent  R2  from  dropping 
by  more  ^  or  by  an  absolute  amount  of  0.01.  A  15:1  rule  would  prevent  R2 
from  dropping  by  more  than  0.075  for  low  R2  (Figure  4.1). 


4.5  Shrinkage 


75 


4.5  Shrinkage 

The  term  shrinkage  is  used  in  regression  modeling  to  denote  two  ideas.  The 
first  meaning  relates  to  the  slope  of  a  calibration  plot ,  which  is  a  plot  of 
observed  responses  against  predicted  responsesk.  When  a  dataset  is  used  to 
fit  the  model  parameters  as  well  as  to  obtain  the  calibration  plot,  the  usual 
estimation  process  will  force  the  slope  of  observed  versus  predicted  values  to 
be  one.  When,  however,  parameter  estimates  are  derived  from  one  dataset 
and  then  applied  to  predict  outcomes  on  an  independent  dataset,  overfitting 
will  cause  the  slope  of  the  calibration  plot  (i.e.,  the  shrinkage  factor )  to  be  less 
than  one,  a  result  of  regression  to  the  mean.  Typically,  low  predictions  will  be 
too  low  and  high  predictions  too  high.  Predictions  near  the  mean  predicted 
value  will  usually  be  quite  accurate.  The  second  meaning  of  shrinkage  is  a 
statistical  estimation  method  that  preshrinks  regression  coefficients  towards 
zero  so  that  the  calibration  plot  for  new  data  will  not  need  shrinkage  as  its 
calibration  slope  will  be  one. 

We  turn  first  to  shrinkage  as  an  adverse  result  of  traditional  modeling. 
In  ordinary  linear  regression,  we  know  that  all  of  the  coefficient  estimates 
are  exactly  unbiased  estimates  of  the  true  effect  when  the  model  fits.  Isn’t 
the  existence  of  shrinkage  and  overfitting  implying  that  there  is  some  kind 
of  bias  in  the  parameter  estimates?  The  answer  is  no  because  each  separate 
coefficient  has  the  desired  expectation.  The  problem  lies  in  how  we  use  the 
coefficients.  We  tend  not  to  pick  out  coefficients  at  random  for  interpretation 
but  we  tend  to  highlight  very  small  and  very  large  coefficients. 

A  simple  example  may  suffice.  Consider  a  clinical  trial  with  10  randomly 
assigned  treatments  such  that  the  patient  responses  for  each  treatment  are 
normally  distributed.  We  can  do  an  ANOVA  by  fitting  a  multiple  regres¬ 
sion  model  with  an  intercept  and  nine  dummy  variables.  The  intercept  is  an 
unbiased  estimate  of  the  mean  response  for  patients  on  the  first  treatment, 
and  each  of  the  other  coefficients  is  an  unbiased  estimate  of  the  difference 

in  mean  response  between  the  treatment  in  question  and  the  first  treatment. 

/\  /\ 

A)  +  A  is  an  unbiased  estimate  of  the  mean  response  for  patients  on  the 
second  treatment.  But  if  we  plotted  the  predicted  mean  response  for  patients 
against  the  observed  responses  from  new  data,  the  slope  of  this  calibration 
plot  would  typically  be  smaller  than  one.  This  is  because  in  making  this  plot 
we  are  not  picking  coefficients  at  random  but  we  are  sorting  the  coefficients 
into  ascending  order.  The  treatment  group  having  the  lowest  sample  mean 
response  will  usually  have  a  higher  mean  in  the  future,  and  the  treatment 
group  having  the  highest  sample  mean  response  will  typically  have  a  lower 
mean  in  the  future.  The  sample  mean  of  the  group  having  the  highest  sample 
mean  is  not  an  unbiased  estimate  of  its  population  mean. 


k  An  even  more  stringent  assessment  is  obtained  by  stratifying  calibration  curves  by 
predictor  settings. 
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As  an  illustration,  let  us  draw  20  samples  of  size  n  —  50  from  a  uniform 
distribution  for  which  the  true  mean  is  0.5.  Figure  4.2  displays  the  20  means 
sorted  into  ascending  order,  similar  to  plotting  Y  versus  Y  =  X [3  based 
on  least  squares  after  sorting  by  X/3.  Bias  in  the  very  lowest  and  highest 
estimates  is  evident. 


set . seed 

(123) 

L 

n  50 

y  V-  run 

if  (20  *n) 

group  V- 

rep (1:20 

* 

each  = 

n) 

ybar  V- 

t apply  (y  , 

group 

,  mean) 

ybar  V- 

sort ( ybar 

) 

plot  ( 1 : 20  ,  ybar  , 

type  =  ' 

n ' ,  axes 

=FALSE ,  yl 

im=c ( . 3 , . 7 ) , 

xlab  = ' Group  ' 

* 

ylab 

= ' Group 

Mean  '  ) 

lines  ( 1  : 

20 ,  ybar ) 

point s  ( 1 

: 20 ,  ybar 

9 

pch  = 

20 ,  cex= 

.5) 

axis (2) 

axis  (1  , 

at  =1 : 20  , 

labels 

=FALSE ) 

f  or ( j  in 

1:20)  ax 

i 

s  (1  . 

at=j ,  labels =names 

(ybar )  [ j ] ) 

abline  (h 

= . 5 ,  col = 

gray  (  . 

85)) 

0.7 


0.6  -I 

c 

CO 

CD 

1.  0.5- 

Z5 

O 

O 

0.4  -I 


0.3  — 

I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I 

16  6  17  2  10  14  20  9  8  7  11  18  5  4  3  1  15  13  19  12 

Group 

Fig.  4.2  Sorted  means  from  20  samples  of  size  50  from  a  uniform  [0,  1]  distribution. 
The  reference  line  at  0.5  depicts  the  true  population  value  of  all  of  the  means. 


When  we  want  to  highlight  a  treatment  that  is  not  chosen  at  random  (or  a 
priori),  the  data-based  selection  of  that  treatment  needs  to  be  compensated 
for  in  the  estimation  process.1  It  is  well  known  that  the  use  of  shrinkage 

1  It  is  interesting  that  researchers  are  quite  comfortable  with  adjusting  P- values  for 
post  hoc  selection  of  comparisons  using,  for  example,  the  Bonferroni  inequality,  but 
they  do  not  realize  that  post  hoc  selection  of  comparisons  also  biases  point  estimates. 
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methods  such  as  the  James-Stein  estimator  to  pull  treatment  means  toward 
the  grand  mean  over  all  treatments  results  in  estimates  of  treatment-specific 
means  that  are  far  superior  to  ordinary  stratified  means.1'6 

Turning  from  a  cell  means  model  to  the  general  case  where  predicted  values 

/\ 

are  general  linear  combinations  X/3,  the  slope  7  of  properly  transformed 
responses  Y  against  X/3  (sorted  into  ascending  order)  will  be  less  than  one 
on  new  data.  Estimation  of  the  shrinkage  coefficient  7  allows  quantification  of 
the  amount  of  overfitting  present,  and  it  allows  one  to  estimate  the  likelihood 
that  the  model  will  reliably  predict  new  observations,  van  Houwelingen  and  le 
Cessie  [633,  Eq.  77]  provided  a  heuristic  shrinkage  estimate  that  has  worked 
well  in  several  examples: 


7 


model  x2  ~  V 
model  x2 


(4.3) 


where  p  is  the  total  degrees  of  freedom  for  the  predictors  and  model  y2  is 
the  likelihood  ratio  y2  statistic  for  testing  the  joint  influence  of  ah  predictors 
simultaneously  (see  Section  9.3.1).  For  ordinary  linear  models,  van  Houwelin- 
gen  and  le  Cessie  proposed  a  shrinkage  factor  7  that  can  be  shown  to  equal 

-^Sr,  where  the  adjusted  R 2  is  given  by 


n—p—l  hgd.j 


#adj  =  1  -  (1  -  R2) 


n 


1 


n  —  p—l 


(4.4) 


For  such  linear  models  with  an  intercept  /?o,  the  shrunken  estimate  of  j3  is 


fo  =  (1-7)^  +  7/3o 

Pj  =  iPjd  =!,•••, P, 


(4.5) 


where  Y  is  the  mean  of  the  response  vector.  Again,  when  stepwise  fitting  is 
used,  the  p  in  these  equations  is  much  closer  to  the  number  of  candidate  de¬ 
grees  of  freedom  rather  than  the  number  in  the  “final”  model.  See  Section  5.3 
for  methods  of  estimating  7  using  the  bootstrap  (p.  115)  or  cross-validation. 

Now  turn  to  the  second  usage  of  the  term  shrinkage.  Just  as  clothing  is 
sometimes  preshrunk  so  that  it  will  not  shrink  further  once  it  is  purchased, 
better  calibrated  predictions  result  when  shrinkage  is  built  into  the  estima¬ 
tion  process  in  the  first  place.  The  object  of  shrinking  regression  coefficient 
estimates  is  to  obtain  a  shrinkage  coefficient  of  7  =  1  on  new  data.  Thus  by 

A 

somewhat  discounting  [3  we  make  the  model  underfitted  on  the  data  at  hand 
(i.e.,  apparent  7  <  1)  so  that  on  new  data  extremely  low  or  high  predictions 
are  correct. 

Ridge  regression 388,633  is  one  technique  for  placing  restrictions  on  the  pa¬ 
rameter  estimates  that  results  in  shrinkage.  A  ridge  parameter  must  be  chosen 
to  control  the  amount  of  shrinkage.  Penalized  maximum  likelihood  estima¬ 
tion,237,272,388,639  a  generalization  of  ridge  regression,  is  a  general  shrinkage 
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procedure.  A  method  such  as  cross-validation  or  optimization  of  a  modified 
AIC  must  be  used  to  choose  an  optimal  penalty  factor.  An  advantage  of  pe¬ 
nalized  estimation  is  that  one  can  differentially  penalize  the  more  complex 
components  of  the  model  such  as  nonlinear  or  interaction  effects.  A  drawback 
of  ridge  regression  and  penalized  maximum  likelihood  is  that  the  final  model 
is  difficult  to  validate  unbiasedly  since  the  optimal  amount  of  shrinkage  is 
usually  determined  by  examining  the  entire  dataset.  Penalization  is  one  of 
the  best  ways  to  approach  the  “too  many  variables,  too  little  data”  problem. 
See  Section  9.10  for  details. 


4.6  Collinearity 

When  at  least  one  of  the  predictors  can  be  predicted  well  from  the  other 
predictors,  the  standard  errors  of  the  regression  coefficient  estimates  can  be 
inflated  and  corresponding  tests  have  reduced  power.21'  In  stepwise  variable 
selection,  collinearity  can  cause  predictors  to  compete  and  make  the  selection 
of  “important”  variables  arbitrary.  Collinearity  makes  it  difficult  to  estimate 
and  interpret  a  particular  regression  coefficient  because  the  data  have  little 
information  about  the  effect  of  changing  one  variable  while  holding  another 
(highly  correlated)  variable  constant  [101,  Chap.  9].  However,  collinearity 
does  not  affect  the  joint  influence  of  highly  correlated  variables  when  tested 
simultaneously.  Therefore,  once  groups  of  highly  correlated  predictors  are 
identified,  the  problem  can  be  rectified  by  testing  the  contribution  of  an 
entire  set  with  a  multiple  d.f.  test  rather  than  attempting  to  interpret  the 
coefficient  or  one  d.f.  test  for  a  single  predictor. 

Collinearity  does  not  affect  predictions  made  on  the  same  dataset  used  to 
estimate  the  model  parameters  or  on  new  data  that  have  the  same  degree 
of  collinearity  as  the  original  data  [470,  pp.  379-381]  as  long  as  extreme 
extrapolation  is  not  attempted.  Consider  as  two  predictors  the  total  and  LDL 
cholesterols  that  are  highly  correlated.  If  predictions  are  made  at  the  same 
combinations  of  total  and  LDL  cholesterol  that  occurred  in  the  training  data, 
no  problem  will  arise.  However,  if  one  makes  a  prediction  at  an  inconsistent 
combination  of  these  two  variables,  the  predictions  may  be  inaccurate  and 
have  high  standard  errors. 

When  the  ordinary  truncated  power  basis  is  used  to  derive  component 
variables  for  fitting  linear  and  cubic  splines,  as  was  described  earlier,  the 
component  variables  can  be  very  collinear.  It  is  very  unlikely  that  this  will 
result  in  any  problems,  however,  as  the  component  variables  are  connected 
algebraically.  Thus  it  is  not  possible  for  a  combination  of,  for  example,  x  and 
max(r  —  10,  0)  to  be  inconsistent  with  each  other.  Collinearity  problems  are 
then  more  likely  to  result  from  partially  redundant  subsets  of  predictors  as 
in  the  cholesterol  example  above. 
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One  way  to  quantify  collinearity  is  with  variance  inflation  factors  or  VIF, 
which  in  ordinary  least  squares  are  diagonals  of  the  inverse  of  the  X'X  matrix 
scaled  to  have  unit  variance  (except  that  a  column  of  Is  is  retained  corre¬ 
sponding  to  the  intercept).  Note  that  some  authors  compute  VIF  from  the 
correlation  matrix  form  of  the  design  matrix,  omitting  the  intercept.  VIFi  is 
1/(1  —  Rf)  where  Rf  is  the  squared  multiple  correlation  coefficient  between 
column  i  and  the  remaining  columns  of  the  design  matrix.  For  models  that  are 
fitted  with  maximum  likelihood  estimation,  the  information  matrix  is  scaled 
to  correlation  form,  and  VIF  is  the  diagonal  of  the  inverse  of  this  scaled  ma¬ 
trix.147, 654  Then  the  VIF  are  similar  to  those  from  a  weighted  correlation 
matrix  of  the  original  columns  in  the  design  matrix.  Note  that  indexes  such 
as  VIF  are  not  very  informative  as  some  variables  are  algebraically  connected 
to  each  other. 

The  SAS  VARCLUS  procedure539  and  R  varclus  function  can  identify  collinear 
predictors.  Summarizing  collinear  variables  using  a  summary  score  is  more 
powerful  and  stable  than  arbitrary  selection  of  one  variable  in  a  group  of 
collinear  variables  (see  the  next  section). 
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4.7  Data  Reduction 

The  sample  size  need  not  be  as  large  as  shown  in  Table  4.1  if  the  model 
is  to  be  validated  independently  and  if  you  don’t  care  that  the  model  may 
fail  to  validate.  However,  it  is  likely  that  the  model  will  be  overfitted  and 
will  not  validate  if  the  sample  size  does  not  meet  the  guidelines.  Use  of  data 
reduction  methods  before  model  development  is  strongly  recommended  if  the 
conditions  in  Table  4.1  are  not  satisfied,  and  if  shrinkage  is  not  incorporated 
into  parameter  estimation.  Methods  such  as  shrinkage  and  data  reduction 
reduce  the  effective  d.f.  of  the  model,  making  it  more  likely  for  the  model 
to  validate  on  future  data.  Data  reduction  is  aimed  at  reducing  the  number 
of  parameters  to  estimate  in  the  model,  without  distorting  statistical  infer¬ 
ence  for  the  parameters.  This  is  accomplished  by  ignoring  Y  during  data 
reduction.  Manipulations  of  X  in  unsupervised  learning  may  result  in  a  loss 
of  information  for  predicting  V,  but  when  the  information  loss  is  small,  the 
gain  in  power  and  reduction  of  overfitting  more  than  offset  the  loss. 

Some  available  data  reduction  methods  are  given  below. 

1.  Use  the  literature  to  eliminate  unimportant  variables. 

2.  Eliminate  variables  whose  distributions  are  too  narrow. 

3.  Eliminate  candidate  predictors  that  are  missing  in  a  large  number  of  sub¬ 
jects,  especially  if  those  same  predictors  are  likely  to  be  missing  for  future 
applications  of  the  model. 

4.  Use  a  statistical  data  reduction  method  such  as  incomplete  principal  com¬ 
ponent  regression,  nonlinear  generalizations  of  principal  components  such 
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as  principal  surfaces,  sliced  inverse  regression,  variable  clustering,  or  ordi¬ 
nary  cluster  analysis  on  a  measure  of  similarity  between  variables. 

See  Chapters  8  and  14  for  detailed  case  studies  in  data  reduction. 
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Redundancy  Analysis 

There  are  many  approaches  to  data  reduction.  One  rigorous  approach  involves 
removing  predictors  that  are  easily  predicted  from  other  predictors,  using 
flexible  parametric  additive  regression  models.  This  approach  is  unlikely  to 
result  in  a  major  reduction  in  the  number  of  regression  coefficients  to  estimate 
against  T,  but  will  usually  provide  insights  useful  for  later  data  reduction 
over  and  above  the  insights  given  by  methods  based  on  pairwise  correlations 
instead  of  multiple  R 2 . 

The  Hmisc  redun  function  implements  the  following  redundancy  checking 
algorithm. 

•  Expand  each  continuous  predictor  into  restricted  cubic  spline  basis  func¬ 
tions.  Expand  categorical  predictors  into  dummy  variables. 

•  Use  OLS  to  predict  each  predictor  with  all  component  terms  of  all  remain¬ 
ing  predictors  (similar  to  what  the  Hmisc  transcan  function  does).  When  the 
predictor  is  expanded  into  multiple  terms,  use  the  first  canonical  variate111. 

•  Remove  the  predictor  that  can  be  predicted  from  the  remaining  set  with 
the  highest  adjusted  or  regular  R2 . 

•  Predict  all  remaining  predictors  from  their  complement. 

•  Continue  in  like  fashion  until  no  variable  still  in  the  list  of  predictors  can 
be  predicted  with  an  R 2  or  adjusted  R2  greater  than  a  specified  threshold 
or  until  dropping  the  variable  with  the  highest  R2  (adjusted  or  ordinary) 
would  cause  a  variable  that  was  dropped  earlier  to  no  longer  be  predicted 
at  the  threshold  from  the  now  smaller  list  of  predictors. 

Special  consideration  must  be  given  to  categorical  predictors.  One  way  to 
consider  a  categorical  variable  redundant  is  if  a  linear  combination  of  dummy 
variables  representing  it  can  be  predicted  from  a  linear  combination  of  other 
variables.  For  example,  if  there  were  4  cities  in  the  data  and  each  city’s  rainfall 
was  also  present  as  a  variable,  with  virtually  the  same  rainfall  reported  for 
all  observations  for  a  city,  city  would  be  redundant  given  rainfall  (or  vice- 
versa).  If  two  cities  had  the  same  rainfall,  ‘city’  might  be  declared  redundant 
even  though  tied  cities  might  be  deemed  non-redundant  in  another  setting.  A 
second,  more  stringent  way  to  check  for  redundancy  of  a  categorical  predictor 
is  to  ascertain  whether  all  dummy  variables  created  from  the  predictor  are 
individually  redundant.  The  redun  function  implements  both  approaches. 
Examples  of  use  of  redun  are  given  in  two  case  studies. 

m  There  is  an  option  to  force  continuous  variables  to  be  linear  when  they  are  being 
predicted. 
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4-7.2  Variable  Clustering 


Although  the  use  of  subject  matter  knowledge  is  usually  preferred,  statistical 
clustering  techniques  can  be  useful  in  determining  independent  dimensions 
that  are  described  by  the  entire  list  of  candidate  predictors.  Once  each  di¬ 
mension  is  scored  (see  below),  the  task  of  regression  modeling  is  simplified, 
and  one  quits  trying  to  separate  the  effects  of  factors  that  are  measuring  the 
same  phenomenon.  One  type  of  variable  clustering53  is  based  on  a  type  of 
oblique-rotation  principal  component  (PC)  analysis  that  attempts  to  separate 
variables  so  that  the  first  PC  of  each  group  is  representative  of  that  group 
(the  first  PC  is  the  linear  combination  of  variables  having  maximum  vari¬ 
ance  subject  to  normalization  constraints  on  the  coefficients142,144).  Another 
approach,  that  of  doing  a  hierarchical  cluster  analysis  on  an  appropriate  sim¬ 
ilarity  matrix  (such  as  squared  correlations)  will  often  yield  the  same  results. 
For  either  approach,  it  is  often  advisable  to  use  robust  (e.g.,  rank-based) 
measures  for  continuous  variables  if  they  are  skewed,  as  skewed  variables  can 
greatly  affect  ordinary  correlation  coefficients.  Pairwise  deletion  of  missing 
values  is  also  advisable  for  this  procedure — casewise  deletion  can  result  in  a 
small  biased  sample. 

When  variables  are  not  monotonically  related  to  each  other,  Pearson  or 
Spearman  squared  correlations  can  miss  important  associations  and  thus  are 
not  always  good  similarity  measures.  A  general  and  robust  similarity  mea¬ 
sure  is  Hoeffding’s  D,295  which  for  two  variables  X  and  Y  is  a  measure  of 
the  agreement  between  F(x,y)  and  G(x)H(y ),  where  G,H  are  marginal  cu¬ 
mulative  distribution  functions  and  F  is  the  joint  CDF.  The  D  statistic  will 
detect  a  wide  variety  of  dependencies  between  two  variables. 

See  pp.  330  and  458  for  examples  of  variable  clustering. 
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4-7.3  Transformation  and  Scaling  Variables  Without 
Using  Y 

Scaling  techniques  often  allow  the  analyst  to  reduce  the  number  of  parameters 
to  fit  by  estimating  transformations  for  each  predictor  using  only  information 
about  associations  with  other  predictors.  It  may  be  advisable  to  cluster  vari¬ 
ables  before  scaling  so  that  patterns  are  derived  only  from  variables  that  are 
related.  For  purely  categorical  predictors,  methods  such  as  correspondence 
analysis  (see,  for  example,  [108,139,239,391,456])  can  be  useful  for  data  reduc¬ 
tion.  Often  one  can  use  these  techniques  to  scale  multiple  dummy  variables 
into  a  few  dimensions.  For  mixtures  of  categorical  and  continuous  predictors, 
qualitative  principal  component  analysis  such  as  the  maximum  total  variance 
(MTV)  method  of  Young  et  al. 456,681  is  useful.  For  the  special  case  of  repre¬ 
senting  a  series  of  variables  with  one  PC,  the  MTV  method  is  quite  easy  to 
implement . 
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1.  Compute  PC i,  the  first  PC  of  the  variables  to  reduce  Xi, . . . ,  Xq  using 
the  correlation  matrix  of  Xs. 

2.  Use  ordinary  linear  regression  to  predict  PC\  on  the  basis  of  functions  of 
the  Xs,  such  as  restricted  cubic  spline  functions  for  continuous  Xs  or  a 
series  of  dummy  variables  for  polytomous  Xs.  The  expansion  of  each  Xj 
is  regressed  separately  on  PC\ . 

3.  These  separately  fitted  regressions  specify  the  working  transformations  of 
each  X. 

4.  Recompute  PC\  by  doing  a  PC  analysis  on  the  transformed  Xs  (predicted 
values  from  the  fits). 

5.  Repeat  steps  2  to  4  until  the  proportion  of  variation  explained  by  PC\ 
reaches  a  plateau.  This  typically  requires  three  to  four  iterations. 

A  transformation  procedure  that  is  similar  to  MTV  is  the  maximum  gen¬ 
eralized  variance  (MGV)  method  due  to  Sarle  [368,  pp.  1267-1268].  MGV 
involves  predicting  each  variable  from  (the  current  transformations  of)  all 
the  other  variables.  When  predicting  variable  i,  that  variable  is  represented 
as  a  set  of  linear  and  nonlinear  terms  (e.g.,  spline  components).  Analysis  of 
canonical  variates279  can  be  used  to  find  the  linear  combination  of  terms  for 
Xi  (i.e.,  find  a  new  transformation  for  Xi)  and  the  linear  combination  of  the 
current  transformations  of  all  other  variables  (representing  each  variable  as 
a  single,  transformed,  variable)  such  that  these  two  linear  combinations  have 
maximum  correlation.  (For  example,  if  there  are  only  two  variables  X\  and  X2 
represented  as  quadratic  polynomials,  solve  for  a,  6,  c,  d  such  that  aXi  +  bX2 
has  maximum  correlation  with  0X2  -\-dX2.)  The  process  is  repeated  until  the 
transformations  converge.  The  goal  of  MGV  is  to  transform  each  variable  so 
that  it  is  most  similar  to  predictions  from  the  other  transformed  variables. 
MGV  does  not  use  PCs  (so  one  need  not  precede  the  analysis  by  variable 
clustering),  but  once  all  variables  have  been  transformed,  you  may  want  to 
summarize  them  with  the  first  PC. 

The  S AS  prinqual  procedure  of  Kuhfeld368  implements  the  MTV  and  MGV 
methods,  and  allows  for  very  flexible  transformations  of  the  predictors,  in¬ 
cluding  monotonic  splines  and  ordinary  cubic  splines. 

A  very  flexible  automatic  procedure  for  transforming  each  predictor  in 
turn,  based  on  all  remaining  predictors,  is  the  ACE  (alternating  conditional 
expectation)  procedure  of  Breiman  and  Friedman.68  Like  SAS  PROC  prin¬ 
qual,  ACE  handles  monotonically  restricted  transformations  and  categorical 
variables.  It  fits  transformations  by  maximizing  R2  between  one  variable  and 
a  set  of  variables.  It  automatically  transforms  all  variables,  using  the  “super 
smoother”20'  for  continuous  variables.  Unfortunately,  ACE  does  not  handle 
missing  values.  See  Chapter  16  for  more  about  ACE. 

It  must  be  noted  that  at  best  these  automatic  transformation  procedures 
generally  find  only  marginal  transformations,  not  transformations  of  each  pre¬ 
dictor  adjusted  for  the  effects  of  all  other  predictors.  When  adjusted  transfor¬ 
mations  differ  markedly  from  marginal  transformations,  only  joint  modeling 
of  all  predictors  (and  the  response)  will  find  the  correct  transformations. 
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Once  transformations  are  estimated  using  only  predictor  information,  the 
adequacy  of  each  predictor’s  transformation  can  be  checked  by  graphical 
methods,  by  nonparametric  smooths  of  transformed  Xj  versus  1",  or  by  ex¬ 
panding  the  transformed  Xj  using  a  spline  function.  This  approach  of  check¬ 
ing  that  transformations  are  optimal  with  respect  to  Y  uses  the  response 
data,  but  it  accepts  the  initial  transformations  unless  they  are  significantly 
inadequate.  If  the  sample  size  is  low,  or  if  PC\  for  the  group  of  variables  used 
in  deriving  the  transformations  is  deemed  an  adequate  summary  of  those 
variables,  that  PC\  can  be  used  in  modeling.  In  that  way,  data  reduction  is 
accomplished  two  ways:  by  not  using  Y  to  estimate  multiple  coefficients  for 
a  single  predictor,  and  by  reducing  related  variables  into  a  single  score,  after 
transforming  them.  See  Chapter  8  for  a  detailed  example  of  these  scaling 
techniques. 


4-7.^  Simultaneous  Transformation  and  Imputation 

As  mentioned  in  Chapter  3  (p.  52)  if  transformations  are  complex  or  non¬ 
monotonic,  ordinary  imputation  models  may  not  work.  SAS  PROC  PRINQUAL 
implemented  a  method  for  simultaneously  imputing  missing  values  while  solv¬ 
ing  for  transformations.  Unfortunately,  the  imputation  procedure  frequently 
converges  to  imputed  values  that  are  outside  the  allowable  range  of  the  data. 
This  problem  is  more  likely  when  multiple  variables  are  missing  on  the  same 
subjects,  since  the  transformation  algorithm  may  simply  separate  missings 
and  nonmissings  into  clusters. 

A  simple  modification  of  the  MGV  algorithm  of  PRINQUAL  that  simulta¬ 
neously  imputes  missing  values  without  these  problems  is  implemented  in 
the  R  function  transcan.  Imputed  values  are  initialized  to  medians  of  contin¬ 
uous  variables  and  the  most  frequent  category  of  categorical  variables.  For 
continuous  variables,  transformations  are  initialized  to  linear  functions.  For 
categorical  ones,  transformations  may  be  initialized  to  the  identify  function, 
to  dummy  variables  indicating  whether  the  observation  has  the  most  preva¬ 
lent  categorical  value,  or  to  random  numbers.  Then  when  using  canonical 
variates  to  transform  each  variable  in  turn,  observations  that  are  missing  on 
the  current  “dependent”  variable  are  excluded  from  consideration,  although 
missing  values  for  the  current  set  of  “predictors”  are  imputed.  Transformed 
variables  are  normalized  to  have  mean  0  and  standard  deviation  1.  Although 
categorical  variables  are  scored  using  the  first  canonical  variate,  transcan  has 
an  option  to  use  recursive  partitioning  to  obtain  imputed  values  on  the  origi¬ 
nal  scale  (Section  2.5)  for  these  variables.  It  defaults  to  imputing  categorical 
variables  using  the  category  whose  predicted  canonical  score  is  closest  to  the 
predicted  score. 

transcan  uses  restricted  cubic  splines  to  model  continuous  variables.  It  does 
not  implement  monotonicity  constraints,  transcan  automatically  constrains 
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imputed  values  (both  on  transformed  and  original  scales)  to  be  in  the  same 
range  as  non-imputed  ones.  This  adds  much  stability  to  the  resulting  esti¬ 
mates  although  it  can  result  in  a  boundary  effect.  Also,  imputed  values  can 
optionally  be  shrunken  using  Eq.  4.5  to  avoid  overfitting  when  developing 
the  imputation  models.  Optionally,  missing  values  can  be  set  to  specified 
constants  rather  than  estimating  them.  These  constants  are  ignored  during 
the  transformation-estimation  phase11.  This  technique  has  proved  to  be  help¬ 
ful  when,  for  example,  a  laboratory  test  is  not  ordered  because  a  physician 
thinks  the  patient  has  returned  to  normal  with  respect  to  the  lab  parameter 
measured  by  the  test.  In  that  case,  it’s  better  to  use  a  normal  lab  value  for 
missings. 

The  transformation  and  imputation  information  created  by  transcan  may 
be  used  to  transform/impute  variables  in  datasets  not  used  to  develop  the 
transformation  and  imputation  formulas.  There  is  also  an  R  function  to  create 
R  functions  that  compute  the  final  transformed  values  of  each  predictor  given 
input  values  on  the  original  scale. 

As  an  example  of  non-monotonic  transformation  and  imputation,  consider 
a  sample  of  1000  hospitalized  patients  from  the  SUPPORT0  study.352  Two 
mean  arterial  blood  pressure  measurements  were  set  to  missing. 

L 

require (Hmisc ) 

get  Hdat  a  (  support  )  #  Get  data  frame  from  web  site 

heart. rate  V-  support$hrt 

blood . pressure  V-  support $meanbp 
blood . pressure  [400:401] 

Mean  Arterial  Blood  Pressure  Day  3 
[1]  151  136 


blood . pressure  [400 : 401]  <—  NA  #  Create  two  missings 

d  V-  data . frame (heart . rate ,  blood . pressure ) 
par (pch=46)  #  Figure  4.3 

w  V-  transcan  (~  heart. rate  +  blood  .  pressure  ,  tr  ansf  ormed  =TRUE  , 

imput ed =TRUE ,  show . na=TRUE ,  data=d) 


11  If  one  were  to  estimate  transformations  without  removing  observations  that  had 
these  constants  inserted  for  the  current  T-variable,  the  resulting  transformations 
would  likely  have  a  spike  at  Y  =  imputation  constant. 

Study  to  Understand  Prognoses  Preferences  Outcomes  and  Risks  of  Treatments 


o 
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w$ imputed $ blood .pressure 


400  401 

132.4057  109.7741 


t  V-  w$ transf ormed 

spe  round ( c ( spearman ( heart . rat e  ,  blood . pressure ) , 

spearman ( t  [ ,  'he art. rate  ']  , 

t  [ ,  'blood. pressure  ']))  ,  2) 


heart,  rate 


Fig.  4.3  Transformations  fitted  using  transcan.  Tick  marks  indicate  the  two  imputed 
values  for  blood  pressure. 


plot ( heart . rat e  ,  blood . pressure  )  #  Figure  4.4 

plot (t  [ ,  'he art. rate  ']  ,  t  [ ,  'blood. pressure  ']  , 

xlab =' Transf ormed  hr',  ylab =' Transf ormed  bp') 

Spearman’s  rank  correlation  p  between  pairs  of  heart  rate  and  blood  pressure 
was  -0.02,  because  these  variables  each  require  /7-shaped  transformations.  Us¬ 
ing  restricted  cubic  splines  with  five  knots  placed  at  default  quantiles,  tran¬ 
scan  provided  the  transformations  shown  in  Figure  4.3.  Correlation  between 
transformed  variables  is  p  =  —0.13.  The  fitted  transformations  are  similar  to 
those  obtained  from  relating  these  two  variables  to  time  until  death. 


4-7.5  Simple  Scoring  of  Variable  Clusters 

If  a  subset  of  the  predictors  is  a  series  of  related  dichotomous  variables,  a 
simpler  data  reduction  strategy  is  sometimes  employed.  First,  construct  two 
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0  50  100  150  200  250  300  0  2  4  6  8 


heart. rate  Transformed  hr 

Fig.  4.4  The  lower  left  plot  contains  raw  data  (Spearman  p  =  —0.02);  the  lower  right 
is  a  scatterplot  of  the  corresponding  transformed  values  (p  =  —0.13).  Data  courtesy 
of  the  SUPPORT  study352. 


new  predictors  representing  whether  any  of  the  factors  is  positive  and  a  count 
of  the  number  of  positive  factors.  For  the  ordinal  count  of  the  number  of 
positive  factors,  score  the  summary  variable  to  satisfy  linearity  assumptions 
as  discussed  previously.  For  the  more  powerful  predictor  of  the  two  summary 
measures,  test  for  adequacy  of  scoring  by  using  all  dichotomous  variables  as 
candidate  predictors  after  adjusting  for  the  new  summary  variable.  A  residual 
X2  statistic  can  be  used  to  test  whether  the  summary  variable  adequately 
captures  the  predictive  information  of  the  series  of  binary  predictors. p  This 
statistic  will  have  degrees  of  freedom  equal  to  one  less  than  the  number  of 
binary  predictors  when  testing  for  adequacy  of  the  summary  count  (and  hence 
will  have  low  power  when  there  are  many  predictors).  Stratification  by  the 
summary  score  and  examination  of  responses  over  cells  can  be  used  to  suggest 
a  transformation  on  the  score. 

Another  approach  to  scoring  a  series  of  related  dichotomous  predictors  is  to 
have  “experts”  assign  severity  points  to  each  condition  and  then  to  either  sum 
these  points  or  use  a  hierarchical  rule  that  scores  according  to  the  condition 
with  the  highest  points  (see  Section  14.3  for  an  example).  The  latter  has  the 
advantage  of  being  easy  to  implement  for  field  use.  The  adequacy  of  either 
type  of  scoring  can  be  checked  using  tests  of  linearity  in  a  regression  modelq. 


p  Whether  this  statistic  should  be  used  to  change  the  model  is  problematic  in  view 
of  model  uncertainty. 

q  The  R  function  score. binary  in  the  Hmisc  package  (see  Section  6.2)  assists  in 
computing  a  summary  variable  from  the  series  of  binary  conditions. 
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4-7.6  Simplifying  Cluster  Scores 


If  a  variable  cluster  contains  many  individual  predictors,  parsimony  may 
sometimes  be  achieved  by  predicting  the  cluster  score  from  a  subset  of  its 
components  (using  linear  regression  or  CART  (Section  2.5),  for  example). 
Then  a  new  cluster  score  is  created  and  the  response  model  is  rerun  with  the 
new  score  in  the  place  of  the  original  one.  If  one  constituent  variable  has  a 
very  high  R 2  in  predicting  the  original  cluster  score,  the  single  variable  may 
sometimes  be  substituted  for  the  cluster  score  in  refitting  the  model  without 
loss  of  predictive  discrimination. 

Sometimes  it  may  be  desired  to  simplify  a  variable  cluster  by  asking  the 
question  “which  variables  in  the  cluster  are  really  the  predictive  ones?,”  even 
though  this  approach  will  usually  cause  true  predictive  discrimination  to  suf¬ 
fer.  For  clusters  that  are  retained  after  limited  step-down  modeling,  the  entire 
list  of  variables  can  be  used  as  candidate  predictors  and  the  step-down  process 
repeated.  All  variables  contained  in  clusters  that  were  not  selected  initially  are 
ignored.  A  fair  way  to  validate  such  two-stage  models  is  to  use  a  resampling 
method  (Section  5.3)  with  scores  for  deleted  clusters  as  candidate  variables 
for  each  resample,  along  with  all  the  individual  variables  in  the  clusters  the 
analyst  really  wants  to  retain.  A  method  called  battery  reduction  can  be  used 
to  delete  variables  from  clusters  by  determining  if  a  subset  of  the  variables 
can  explain  most  of  the  variance  explained  by  PG\  (see  [142,  Chapter  12] 
and445).  This  approach  does  not  require  examination  of  associations  with  Y . 
Battery  reduction  can  also  be  used  to  find  a  set  of  individual  variables  that 
capture  much  of  the  information  in  the  first  k  principal  components. 
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^.7.7  How  Much  Data  Reduction  Is  Necessary? 

In  addition  to  using  the  sample  size  to  degrees  of  freedom  ratio  as  a  rough 
guide  to  how  much  data  reduction  to  do  before  model  fitting,  the  heuristic 
shrinkage  estimate  in  Equation  4.3  can  also  be  informative.  First,  fit  a  full 
model  with  all  candidate  variables,  nonlinear  terms,  and  hypothesized  inter¬ 
actions.  Let  p  denote  the  number  of  parameters  in  this  model,  aside  from  any 
intercepts.  Let  LR  denote  the  log  likelihood  ratio  y2  for  this  full  model.  The 
estimated  shrinkage  is  (LR  —  p) / LR.  If  this  falls  below  0.9,  for  example,  we 
may  be  concerned  with  the  lack  of  calibration  the  model  may  experience  on 
new  data.  Either  a  shrunken  estimator  or  data  reduction  is  needed.  A  reduced 
model  may  have  acceptable  calibration  if  associations  with  Y  are  not  used  to 
reduce  the  predictors. 

A  simple  method,  with  an  assumption,  can  be  used  to  estimate  the  target 
number  of  total  regression  degrees  of  freedom  q  in  the  model.  In  a  “best 
case,”  the  variables  removed  to  arrive  at  the  reduced  model  would  have  no 
association  with  Y .  The  expected  value  of  the  y2  statistic  for  testing  those 
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variables  would  then  be  p  —  q.  The  shrinkage  for  the  reduced  model  is  then 
on  average  [LR  —  (p  —  q)  —  q\/[ LR  —  (p  —  q)].  Setting  this  ratio  to  be  >  0.9 
and  solving  for  q  gives  q  <  (LR  —  p)/9.  Therefore,  reduction  of  dimensionality 
down  to  q  degrees  of  freedom  would  be  expected  to  achieve  <  10%  shrinkage. 
With  these  assumptions,  there  is  no  hope  that  a  reduced  model  would  have 
acceptable  calibration  unless  LR  >  pH- 9.  If  the  information  explained  by  the 
omitted  variables  is  less  than  one  would  expect  by  chance  (e.g.,  their  total 
X2  is  extremely  small),  a  reduced  model  could  still  be  beneficial,  as  long  as 
the  conservative  bound  (LR  —  q)/hR  >  0.9  or  q  <  LR/10  were  achieved.  This 
conservative  bound  assumes  that  no  y2  is  lost  by  the  reduction,  that  is  that 
the  final  model  y2  ~  LR.  This  is  unlikely  in  practice.  Had  the  p  —  q  omitted 
variables  had  a  larger  y2  of  2  (p  —  q)  (the  break-even  point  for  AIC),  q  must 
be  <  ( LR  —  2p)/S. 

As  an  example,  suppose  that  a  binary  logistic  model  is  being  developed 
from  a  sample  containing  45  events  on  150  subjects.  The  10:1  rule  suggests 
we  can  analyze  4.5  degrees  of  freedom.  The  analyst  wishes  to  analyze  age, 
sex,  and  10  other  variables.  It  is  not  known  whether  interaction  between  age 
and  sex  exists,  and  whether  age  is  linear.  A  restricted  cubic  spline  is  fitted 
with  four  knots,  and  a  linear  interaction  is  allowed  between  age  and  sex. 
These  two  variables  then  need  3  +  1  +  1  =  5  degrees  of  freedom.  The  other 
10  variables  are  assumed  to  be  linear  and  to  not  interact  with  themselves 
or  age  and  sex.  There  is  a  total  of  15  d.f.  The  full  model  with  15  d.f.  has 
LR  =  50.  Expected  shrinkage  from  this  model  is  (50  —  15) / 50  =  0.7.  Since 
LR  >  15  +  9  =  24,  some  reduction  might  yield  a  better  validating  model. 
Reduction  to  q  =  (50  —  15)/9  ~  4  d.f.  would  be  necessary,  assuming  the 
reduced  LR  is  about  50  —  (15  —  4)  =  39.  In  this  case  the  10:1  rule  yields 
about  the  same  value  for  q.  The  analyst  may  be  forced  to  assume  that  age  is 
linear,  modeling  3  d.f.  for  age  and  sex.  The  other  10  variables  would  have  to 
be  reduced  to  a  single  variable  using  principal  components  or  another  scaling 
technique.  The  AlC-based  calculation  yields  a  maximum  of  2.5  d.f. 

If  the  goal  of  the  analysis  is  to  make  a  series  of  hypothesis  tests  (adjusting 
P-values  for  multiple  comparisons)  instead  of  to  predict  future  responses,  the 
full  model  would  have  to  be  used. 

A  summary  of  the  various  data  reduction  methods  is  given  in  Figure  4.5. 

When  principal  component  analysis  or  related  methods  are  used  for  data 
reduction,  the  model  may  be  harder  to  describe  since  internal  coefficients  are 
“hidden.”  R  code  on  p.  141  shows  how  an  ordinary  linear  model  fit  can  be 
used  in  conjunction  with  a  logistic  model  fit  based  on  principal  components 
to  draw  a  nomogram  with  axes  for  all  predictors. 
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Fig.  4.5  Summary  of  Some  Data  Reduction  Methods 


Goals 


Reasons 


Methods 


Group  predictors  so  that 
each  group  represents  a 
single  dimension  that  can 
be  summarized  with  a  sin¬ 
gle  score 


•  |  d.f.  arising  from  mul¬ 
tiple  predictors 

•  Make  PC i  more  reason¬ 
able  summary 


Variable  clustering 

•  Subject  matter  knowl¬ 
edge 

•  Group  predictors  to 
maximize  proportion  of 
variance  explained  by 
PC\  of  each  group 

•  Hierarchical  clustering 
using  a  matrix  of  simi¬ 
larity  measures  between 
predictors 


Transform  predictors 


•  |  d.f.  due  to  nonlin¬ 
ear  and  dummy  variable 
components 

•  Allows  predictors  to  be 
optimally  combined 

•  Make  PC\  more  reason¬ 
able  summary 

•  Use  in  customized 
model  for  imputing 
missing  values  on  each 
predictor 


•  Maximum  total  vari¬ 
ance  on  a  group  of  re¬ 
lated  predictors 

•  Canonical  variates  on 
the  total  set  of  predic¬ 
tors 


Score  a  group  of  predic¬ 
tors 


|  d.f.  for  group  to  unity 


•  PC  i 

•  Simple  point  scores 


Multiple  dimensional 

|  d.f.  for  all  predictors 

scoring  of  all  predictors 

combined 

Principal  components 
1,2 , ,k,k  <  p  com¬ 
puted  from  all  trans¬ 
formed  predictors 


4.8  Other  Approaches  to  Predictive  Modeling 

The  approaches  recommended  in  this  text  are 

•  fitting  fully  pre-specified  models  without  deletion  of  “insignificant”  predic¬ 
tors 

•  using  data  reduction  methods  (masked  to  Y)  to  reduce  the  dimensionality 
of  the  predictors  and  then  fitting  the  number  of  parameters  the  data’s 
information  content  can  support 
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•  using  shrinkage  (penalized  estimation)  to  fit  a  large  model  without  worry¬ 
ing  about  the  sample  size. 

Data  reduction  approaches  covered  in  the  last  section  can  yield  very  inter¬ 
pretable,  stable  models,  but  there  are  many  decisions  to  be  made  when  using  a 
two-stage  (reduction/model  fitting)  approach.  Newer  single  stage  approaches 
are  evolving.  These  new  approaches,  listed  on  the  text’s  web  site,  handle 
continuous  predictors  well,  unlike  recursive  partitioning. 

When  data  reduction  is  not  required,  generalized  additive  models277,674 
should  also  be  considered. 


4.9  Overly  Influential  Observations 

Every  observation  should  influence  the  fit  of  a  regression  model.  It  can  be 
disheartening,  however,  if  a  significant  treatment  effect  or  the  shape  of  a 
regression  effect  rests  on  one  or  two  observations.  Overly  influential  obser¬ 
vations  also  lead  to  increased  variance  of  predicted  values,  especially  when 
variances  are  estimated  by  bootstrapping  after  taking  variable  selection  into 
account.  In  some  cases,  overly  influential  observations  can  cause  one  to  aban¬ 
don  a  model,  “change”  the  data,  or  get  more  data.  Observations  can  be  overly 
influential  for  several  major  reasons. 

1.  The  most  common  reason  is  having  too  few  observations  for  the  complex¬ 
ity  of  the  model  being  fitted.  Remedies  for  this  have  been  discussed  in 
Sections  4.7  and  4.3. 

2.  Data  transcription  or  data  entry  errors  can  ruin  a  model  fit. 

3.  Extreme  values  of  the  predictor  variables  can  have  a  great  impact,  even 
when  these  values  are  validated  for  accuracy.  Sometimes  the  analyst  may 
deem  a  subject  so  atypical  of  other  subjects  in  the  study  that  deletion 
of  the  case  is  warranted.  On  other  occasions,  it  is  beneficial  to  truncate 
measurements  where  the  data  density  ends.  In  one  dataset  of  4000  patients 
and  2000  deaths,  white  blood  count  (WBC)  ranged  from  500  to  100,000 
with  .05  and  .95  quantiles  of  2755  and  26,700,  respectively.  Predictions 
from  a  linear  spline  function  of  WBC  were  sensitive  to  WBC  >  60,000,  for 
which  there  were  16  patients.  There  were  46  patients  with  WBC  >  40,000. 
Predictions  were  found  to  be  more  stable  when  WBC  was  truncated  at 
40,000,  that  is,  setting  WBC  to  40,000  if  WBC  >  40,000. 

4.  Observations  containing  disagreements  between  the  predictors  and  the  re¬ 
sponse  can  influence  the  fit.  Such  disagreements  should  not  lead  to  discard¬ 
ing  the  observations  unless  the  predictor  or  response  values  are  erroneous 
as  in  Reason  3,  or  the  analysis  is  made  conditional  on  observations  being 
unlike  the  influential  ones.  In  one  example  a  single  extreme  predictor  value 
in  a  sample  of  size  8000  that  was  not  on  a  straight  line  relationship  with 
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the  other  (X,  Y)  pairs  caused  a  y2  of  36  for  testing  nonlinearity  of  the  pre¬ 
dictor.  Remember  that  an  imperfectly  fitting  model  is  a  fact  of  life,  and 
discarding  the  observations  can  inflate  the  model’s  predictive  accuracy.  On 
rare  occasions,  such  lack  of  fit  may  lead  the  analyst  to  make  changes  in 
the  model’s  structure,  but  ordinarily  this  is  best  done  from  the  “ground 
up”  using  formal  tests  of  lack  of  fit  (e.g.,  a  test  of  linearity  or  interaction). 

Influential  observations  of  the  second  and  third  kinds  can  often  be  detected 
by  careful  quality  control  of  the  data.  Statistical  measures  can  also  be  helpful. 
The  most  common  measures  that  apply  to  a  variety  of  regression  models  are 
leverage ,  DFBETAS,  DFFIT,  and  DFFITS. 

Leverage  measures  the  capacity  of  an  observation  to  be  influential  due 
to  having  extreme  predictor  values.  Such  an  observation  is  not  necessarily 
influential.  To  compute  leverage  in  ordinary  least  squares,  we  define  the  hat 
matrix  H  given  by 

H  =  X(X'X)~1X'.  (4.6) 

H  is  the  matrix  that  when  multiplied  by  the  response  vector  gives  the  pre¬ 
dicted  values,  so  it  measures  how  an  observation  estimates  its  own  predicted 
response.  The  diagonals  ha  of  H  are  the  leverage  measures  and  they  are  not 
influenced  by  Y.  It  has  been  suggested47  that  ha  >  2(p+\)/n  signal  a  high 
leverage  point,  where  p  is  the  number  of  columns  in  the  design  matrix  X 
aside  from  the  intercept  and  n  is  the  number  of  observations.  Some  believe 
that  the  distribution  of  ha  should  be  examined  for  values  that  are  higher 
than  typical. 

DFBETAS  is  the  change  in  the  vector  of  regression  coefficient  estimates 
upon  deletion  of  each  observation  in  turn,  scaled  by  their  standard  errors.4^ 
Since  DFBETAS  encompasses  an  effect  for  each  predictor’s  coefficient,  DF¬ 
BETAS  allows  the  analyst  to  isolate  the  problem  better  than  some  of  the 
other  measures.  DFFIT  is  the  change  in  the  predicted  X/3  when  the  observa¬ 
tion  is  dropped,  and  DFFITS  is  DFFIT  standardized  by  the  standard  error 
of  the  estimate  of  X/3.  In  both  cases,  the  standard  error  used  for  normal¬ 
ization  is  recomputed  each  time  an  observation  is  omitted.  Some  classify  an 
observation  as  overly  influential  when  |  DFFITS  I  >  2v/(p+  l)/(n -p-  1), 
while  others  prefer  to  examine  the  entire  distribution  of  DFFITS  to  identify 
“outliers”.47 

Section  10.7  discusses  influence  measures  for  the  logistic  model,  which 
requires  maximum  likelihood  estimation.  These  measures  require  the  use  of 
special  residuals  and  information  matrices  (in  place  of  X'X). 

If  truly  influential  observations  are  identified  using  these  indexes,  careful 
thought  is  needed  to  decide  how  (or  whether)  to  deal  with  them.  Most  im¬ 
portant,  there  is  no  substitute  for  careful  examination  of  the  dataset  before 
doing  any  analyses."  Spence  and  Garrison  [581,  p.  16]  feel  that 

Although  the  identification  of  aberrations  receives  considerable  attention  in 
most  modern  statistical  courses,  the  emphasis  sometimes  seems  to  be  on  dis¬ 
posing  of  embarrassing  data  by  searching  for  sources  of  technical  error  or 
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minimizing  the  influence  of  inconvenient  data  by  the  application  of  resistant 
methods.  Working  scientists  often  find  the  most  interesting  aspect  of  the  anal¬ 
ysis  inheres  in  the  lack  of  fit  rather  than  the  fit  itself. 


4.10  Comparing  Two  Models 

Frequently  one  wants  to  choose  between  two  competing  models  on  the  ba¬ 
sis  of  a  common  set  of  observations.  The  methods  that  follow  assume  that 
the  performance  of  the  models  is  evaluated  on  a  sample  not  used  to  develop 
either  one.  In  this  case,  predicted  values  from  the  model  can  usually  be  con¬ 
sidered  as  a  single  new  variable  for  comparison  with  responses  in  the  new 
dataset.  These  methods  listed  below  will  also  work  if  the  models  are  com¬ 
pared  using  the  same  set  of  data  used  to  fit  each  one,  as  long  as  both  models 
have  the  same  effective  number  of  (candidate  or  actual)  parameters.  This 
requirement  prevents  us  from  rewarding  a  model  just  because  it  overfits  the 
training  sample  (see  Section  9.8.1  for  a  method  comparing  two  models  of  dif¬ 
fering  complexity).  The  methods  can  also  be  enhanced  using  bootstrapping 
or  cross-validation  on  a  single  sample  to  get  a  fair  comparison  when  the  play¬ 
ing  field  is  not  level,  for  example,  when  one  model  had  more  opportunity  for 
fitting  or  overfitting  the  responses. 

Some  of  the  criteria  for  choosing  one  model  over  the  other  are 

1.  calibration  (e.g.,  one  model  is  well-calibrated  and  the  other  is  not), 

2.  discrimination, 

3.  face  validity, 

4.  measurement  errors  in  required  predictors, 

5.  use  of  continuous  predictors  (which  are  usually  better  defined  than  cate¬ 
gorical  ones), 

6.  omission  of  “insignificant”  variables  that  nonetheless  make  sense  as  risk 
factors, 

7.  simplicity  (although  this  is  less  important  with  the  availability  of  comput¬ 
ers),  and 

8.  lack  of  fit  for  specific  types  of  subjects. 

Items  3  through  7  require  subjective  judgment,  so  we  focus  on  the  other  as¬ 
pects.  If  the  purpose  of  the  models  is  only  to  rank-order  subjects,  calibration 
is  not  an  issue.  Otherwise,  a  model  having  poor  calibration  can  be  dismissed 
outright.  Given  that  the  two  models  have  similar  calibration,  discrimination 
should  be  examined  critically.  Various  statistical  indexes  can  quantify  dis¬ 
crimination  ability  (e.g.,  R2,  model  y2,  Somers’  Dxy ,  Spearman’s  p,  area  un¬ 
der  ROC  curve — see  Section  10.8).  Rank  measures  ( Dxy,p ,  ROC  area)  only 
measure  how  well  predicted  values  can  rank-order  responses.  For  example, 
predicted  probabilities  of  0.01  and  0.99  for  a  pair  of  subjects  are  no  better 
than  probabilities  of  0.2  and  0.8  using  rank  measures,  if  the  first  subject  had 
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a  lower  response  value  than  the  second.  Therefore,  rank  measures  such  as 
ROC  area  ( c  index),  although  fine  for  describing  a  given  model,  may  not  be 
very  sensitive  in  choosing  between  two  models118,488,493.  This  is  especially 
true  when  the  models  are  strong,  as  it  is  easier  to  move  a  rank  correlation 
from  0.6  to  0.7  than  it  is  to  move  it  from  0.9  to  1.0.  Measures  such  as  R 2  and 
the  model  y2  statistic  (calculated  from  the  predicted  and  observed  responses) 
are  more  sensitive.  Still,  one  may  not  know  how  to  interpret  the  added  utility 
of  a  model  that  boosts  the  R2  from  0.80  to  0.81. 

Again  given  that  both  models  are  equally  well  calibrated,  discrimination 
can  be  studied  more  simply  by  examining  the  distribution  of  predicted  values 
Y.  Suppose  that  the  predicted  value  is  the  probability  that  a  subject  dies. 
Then  high-resolution  histograms  of  the  predicted  risk  distributions  for  the 
two  models  can  be  very  revealing.  If  one  model  assigns  0.02  of  the  sample  to 
a  risk  of  dying  above  0.9  while  the  other  model  assigns  0.08  of  the  sample  to 
the  high  risk  group,  the  second  model  is  more  discriminating.  The  worth  of  a 
model  can  be  judged  by  how  far  it  goes  out  on  a  limb  while  still  maintaining 
good  calibration. 

Frequently,  one  model  will  have  a  similar  discrimination  index  to  another 
model,  but  the  likelihood  ratio  y2  statistic  is  meaningfully  greater  for  one.  As¬ 
suming  corrections  have  been  made  for  complexity,  the  model  with  the  higher 
X2  usually  has  a  better  fit  for  some  subjects,  although  not  necessarily  for  the 
average  subject.  A  crude  plot  of  predictions  from  the  first  model  against 
predictions  from  the  second,  possibly  stratified  by  E,  can  help  describe  the 
differences  in  the  models.  More  specific  analyses  will  determine  the  charac¬ 
teristics  of  subjects  where  the  differences  are  greatest.  Large  differences  may 
be  caused  by  an  omitted,  underweighted,  or  improperly  transformed  predic¬ 
tor,  among  other  reasons.  In  one  example,  two  models  for  predicting  hospital 
mortality  in  critically  ill  patients  had  the  same  discrimination  index  (to  two 
decimal  places).  For  the  relatively  small  subset  of  patients  with  extremely  low 
white  blood  counts  or  serum  albumin,  the  model  that  treated  these  factors 
as  continuous  variables  provided  predictions  that  were  very  much  different 
from  a  model  that  did  not. 

When  comparing  predictions  for  two  models  that  may  not  be  calibrated 
(from  overfitting,  e.g.),  the  two  sets  of  predictions  may  be  shrunk  so  as  to 
not  give  credit  for  overfitting  (see  Equation  4.3). 

Sometimes  one  wishes  to  compare  two  models  that  used  the  response  vari¬ 
able  differently,  a  much  more  difficult  problem.  For  example,  an  investigator 
may  want  to  choose  between  a  survival  model  that  used  time  as  a  continuous 
variable,  and  a  binary  logistic  model  for  dead/alive  at  six  months.  Here,  other 
considerations  are  also  important  (see  Section  17.1).  A  model  that  predicts 
dead/alive  at  six  months  does  not  use  the  response  variable  effectively,  and 
it  provides  no  information  on  the  chance  of  dying  within  three  months. 

When  one  or  both  of  the  models  is  fitted  using  least  squares,  it  is  useful 
to  compare  them  using  an  error  measure  that  was  not  used  as  the  optimiza¬ 
tion  criterion,  such  as  mean  absolute  error  or  median  absolute  error.  Mean 
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and  median  absolute  errors  are  excellent  measures  for  judging  the  value  of  a 
model  developed  without  transforming  the  response  to  a  model  fitted  after 
transforming  T,  then  back-transforming  to  get  predictions. 


4.11  Improving  the  Practice  of  Multivariable  Prediction 

Standards  for  published  predictive  modeling  and  feature  selection  in  high¬ 
dimensional  problems  are  not  very  high.  There  are  several  things  that  a  good 

analyst  can  do  to  improve  the  situation. 

1.  Insist  on  validation  of  predictive  models  and  discoveries,  using  rigorous 
internal  validation  based  on  resampling  or  using  external  validation. 

2.  Show  collaborators  that  split-sample  validation  is  not  appropriate  unless 
the  number  of  subjects  is  huge 

•  This  can  be  demonstrated  by  spliting  the  data  more  than  once  and 
seeing  volatile  results,  and  by  calculating  a  confidence  interval  for  the 
predictive  accuracy  in  the  test  dataset  and  showing  that  it  is  very  wide. 

3.  Run  a  simulation  study  with  no  real  associations  and  show  that  asso¬ 
ciations  are  easy  to  find  if  a  dangerous  data  mining  procedure  is  used. 
Alternately,  analyze  the  collaborator’s  data  after  randomly  permuting  the 
Y  vector  and  show  some  “positive”  findings. 

4.  Show  that  alternative  explanations  are  easy  to  posit.  For  example: 

•  The  importance  of  a  risk  factor  may  disappear  if  5  “unimportant”  risk 
factors  are  added  back  to  the  model 

•  Omitted  main  effects  can  explain  away  apparent  interactions. 

•  Perform  a  uniqueness  analysis :  attempt  to  predict  the  predicted  val¬ 
ues  from  a  model  derived  by  data  torture  from  all  of  the  features  not 
used  in  the  model.  If  one  can  obtain  R2  =  0.85  in  predicting  the  “win¬ 
ning”  feature  signature  (predicted  values)  from  the  “losing”  features,  the 
“winning”  pattern  is  not  unique  and  may  be  unreliable. 


4.12  Summary:  Possible  Modeling  Strategies 

Some  possible  global  modeling  strategies  are  to 

•  Use  a  method  known  not  to  work  well  (e.g.,  stepwise  variable  selection 
without  penalization;  recursive  partitioning  resulting  in  a  single  tree),  doc¬ 
ument  how  poorly  the  model  performs  (e.g.  using  the  bootstrap),  and  use 
the  model  anyway 

•  Develop  a  black  box  model  that  performs  poorly  and  is  difficult  to  interpret 
(e.g.,  does  not  incorporate  penalization) 
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•  Develop  a  black  box  model  that  performs  well  and  is  difficult  to  interpret 

•  Develop  interpretable  approximations  to  the  black  box 

•  Develop  an  interpretable  model  (e.g.  give  priority  to  additive  effects)  that 
performs  well  and  is  likely  to  perform  equally  well  on  future  data  from  the 
same  stream. 

As  stated  in  the  Preface,  the  strategy  emphasized  in  this  text,  stemming 
from  the  last  philosophy,  is  to  decide  how  many  degrees  of  freedom  can  be 
“spent,”  where  they  should  be  spent,  and  then  to  spend  them.  If  statistical 
tests  or  confidence  limits  are  required,  later  reconsideration  of  how  d.f.  are 
spent  is  not  usually  recommended.  In  what  follows  some  default  strategies 
are  elaborated.  These  strategies  are  far  from  failsafe,  but  they  should  allow 
the  reader  to  develop  a  strategy  that  is  tailored  to  a  particular  problem.  At 
the  least  these  default  strategies  are  concrete  enough  to  be  criticized  so  that 
statisticians  can  devise  better  ones. 


4-12.1  Developing  Predictive  Models 

The  following  strategy  is  generic  although  it  is  aimed  principally  at  the  de¬ 
velopment  of  accurate  predictive  models. 

1.  Assemble  as  much  accurate  pertinent  data  as  possible,  with  wide  distri¬ 
butions  for  predictor  values.  For  survival  time  data,  follow-up  must  be 
sufficient  to  capture  enough  events  as  well  as  the  clinically  meaningful 
phases  if  dealing  with  a  chronic  process. 

2.  Formulate  good  hypotheses  that  lead  to  specification  of  relevant  candi¬ 
date  predictors  and  possible  interactions.  Don’t  use  Y  (either  informally 
using  graphs,  descriptive  statistics,  or  tables,  or  formally  using  hypothe¬ 
sis  tests  or  estimates  of  effects  such  as  odds  ratios)  in  devising  the  list  of 
candidate  predictors. 

3.  If  there  are  missing  Y  values  on  a  small  fraction  of  the  subjects  but  Y 
can  be  reliably  substituted  by  a  surrogate  response,  use  the  surrogate  to 
replace  the  missing  values.  Characterize  tendencies  for  Y  to  be  missing 
using,  for  example,  recursive  partitioning  or  binary  logistic  regression. 
Depending  on  the  model  used,  even  the  information  on  A  for  observa- 

A 

tions  with  missing  Y  can  be  used  to  improve  precision  of  /3,  so  multiple 
imputation  of  Y  can  sometimes  be  effective.  Otherwise,  discard  observa¬ 
tions  having  missing  Y. 

4.  Impute  missing  As  if  the  fraction  of  observations  with  any  missing  As  is 
not  tiny.  Characterize  observations  that  had  to  be  discarded.  Special  im¬ 
putation  models  may  be  needed  if  a  continuous  A  needs  a  non-monotonic 
transformation  (p.  52).  These  models  can  simultaneously  impute  missing 
values  while  determining  transformations.  In  most  cases,  multiply  impute 
missing  As  based  on  other  As  and  T,  and  other  available  information 
about  the  missing  data  mechanism. 
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5.  For  each  predictor  specify  the  complexity  or  degree  of  nonlinearity  that 
should  be  allowed  (see  Section  4.1).  When  prior  knowledge  does  not  in¬ 
dicate  that  a  predictor  has  a  linear  effect  on  the  property  C(Y\X)  (the 
property  of  the  response  that  can  be  linearly  related  to  X),  specify  the 
number  of  degrees  of  freedom  that  should  be  devoted  to  the  predictor. 
The  d.f.  (or  number  of  knots)  can  be  larger  when  the  predictor  is  thought 
to  be  more  important  in  predicting  Y  or  when  the  sample  size  is  large. 

6.  If  the  number  of  terms  fitted  or  tested  in  the  modeling  process  (counting 
nonlinear  and  cross-product  terms)  is  too  large  in  comparison  with  the 
number  of  outcomes  in  the  sample,  use  data  reduction  (ignoring  Y)  until 
the  number  of  remaining  free  variables  needing  regression  coefficients  is 
tolerable.  Use  the  m/10  or  m/15  rule  or  an  estimate  of  likely  shrinkage 
or  overfitting  (Section  4.7)  as  a  guide.  Transformations  determined  from 
the  previous  step  may  be  used  to  reduce  each  predictor  into  1  d.f.,  or  the 
transformed  variables  may  be  clustered  into  highly  correlated  groups  if 
more  data  reduction  is  required.  Alternatively,  use  penalized  estimation 
with  the  entire  set  of  variables.  This  will  also  effectively  reduce  the  total 
degrees  of  freedom.2'2 

7.  Use  the  entire  sample  in  the  model  development  as  data  are  too  precious 
to  waste.  If  steps  listed  below  are  too  difficult  to  repeat  for  each  bootstrap 
or  cross-validation  sample,  hold  out  test  data  from  all  model  development 
steps  that  follow. 

8.  When  you  can  test  for  model  complexity  in  a  very  structured  way,  you 
may  be  able  to  simplify  the  model  without  a  great  need  to  penalize  the 
final  model  for  having  made  this  initial  look.  For  example,  it  can  be 
advisable  to  test  an  entire  group  of  variables  (e.g.,  those  more  expensive 
to  collect)  and  to  either  delete  or  retain  the  entire  group  for  further 
modeling,  based  on  a  single  P-value  (especially  if  the  P  value  is  not 
between  0.05  and  0.2).  Another  example  of  structured  testing  to  simplify 
the  “initial”  model  is  making  all  continuous  predictors  have  the  same 
number  of  knots  fc,  varying  k  from  0  (linear),  3,4,5, .. .  ,  and  choosing 
the  value  of  k  that  optimizes  AIC.  A  composite  test  of  all  nonlinear  effects 
in  a  model  can  also  be  used,  and  statistical  inferences  are  not  invalidated 
if  the  global  test  of  nonlinearity  yields  P  >  0.2  or  so  and  the  analyst 
deletes  all  nonlinear  terms. 

9.  Make  tests  of  linearity  of  effects  in  the  model  only  to  demonstrate  to 
others  that  such  effects  are  often  statistically  significant.  Don’t  remove 
insignificant  effects  from  the  model  when  tested  separately  by  predictor. 
Any  examination  of  the  response  that  might  result  in  simplifying  the 
model  needs  to  be  accounted  for  in  computing  confidence  limits  and  other 
statistics.  It  is  preferable  to  retain  the  complexity  that  was  prespecified 
in  Step  5  regardless  of  the  results  of  assessments  of  nonlinearity. 
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10.  Check  additivity  assumptions  by  testing  prespecified  interaction  terms. 
If  the  global  test  for  additivity  is  significant  or  equivocal,  all  prespecified 
interactions  should  be  retained  in  the  model.  If  the  test  is  decisive  (e.g., 
P  0.3),  all  interaction  terms  can  be  omitted,  and  m  all  likelihood  there 
is  no  need  to  repeat  this  pooled  test  for  each  resample  during  model 
validation.  In  other  words,  one  can  assume  that  had  the  global  interaction 
test  been  carried  out  for  each  bootstrap  resample  it  would  have  been 
insignificant  at  the  0.05  level  more  than,  say,  0.9  of  the  time.  In  this  large 
P-value  case  the  pooled  interaction  test  did  not  induce  an  uncertainty  in 
model  selection  that  needed  accounting. 

11.  Check  to  see  if  there  are  overly  influential  observations. 

12.  Check  distributional  assumptions  and  choose  a  different  model  if  needed. 

13.  Do  limited  backwards  step-down  variable  selection  if  parsimony  is  more 
important  than  accuracy.582  The  cost  of  doing  any  aggressive  variable 
selection  is  that  the  variable  selection  algorithm  must  also  be  included 
in  a  resampling  procedure  to  properly  validate  the  model  or  to  compute 
confidence  limits  and  the  like. 

14.  This  is  the  “final”  model. 

15.  Interpret  the  model  graphically  (Section  5.1)  and  by  examining  predicted 
values  and  using  appropriate  significance  tests  without  trying  to  interpret 
some  of  the  individual  model  parameters.  For  collinear  predictors  obtain 
pooled  tests  of  association  so  that  competition  among  variables  will  not 
give  misleading  impressions  of  their  total  significance. 

16.  Validate  the  final  model  for  calibration  and  discrimination  ability,  prefer¬ 
ably  using  bootstrapping  (see  Section  5.3).  Steps  9  to  13  must  be  repeated 
for  each  bootstrap  sample,  at  least  approximately.  For  example,  if  age  was 
transformed  when  building  the  final  model,  and  the  transformation  was 
suggested  by  the  data  using  a  fit  involving  age  and  age2,  each  bootstrap 
repetition  should  include  both  age  variables  with  a  possible  step-down 
from  the  quadratic  to  the  linear  model  based  on  automatic  significance 
testing  at  each  step. 

17.  Shrink  parameter  estimates  if  there  is  overfitting  but  no  further  data 
reduction  is  desired,  if  shrinkage  was  not  built  into  the  estimation  process. 

18.  When  missing  values  were  imputed,  adjust  final  variance-covariance  ma¬ 
trix  for  imputation  wherever  possible  (e.g.,  using  bootstrap  or  multiple 
imputation).  This  may  affect  some  of  the  other  results. 

19.  When  all  steps  of  the  modeling  strategy  can  be  automated,  consider 
using  Faraway’s  method186  to  penalize  for  the  randomness  inherent  in 
the  multiple  steps. 

20.  Develop  simplifications  to  the  full  model  by  approximating  it  to  any 
desired  degrees  of  accuracy  (Section  5.5). 
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4-12.2  Developing  Models  for  Effect  Estimation 

By  effect  estimation  is  meant  point  and  interval  estimation  of  differences  in 
properties  of  the  responses  between  two  or  more  settings  of  some  predictors,  or 
estimating  some  function  of  these  differences  such  as  the  antilog.  In  ordinary 
multiple  regression  with  no  transformation  of  Y  such  differences  are  absolute 
estimates.  In  regression  involving  log(Y’)  or  in  logistic  or  proportional  hazards 
models,  effect  estimation  is,  at  least  initially,  concerned  with  estimation  of 
relative  effects.  As  discussed  on  pp.  4  and  224,  estimation  of  absolute  effects 
for  these  models  must  involve  accurate  prediction  of  overall  response  values, 
so  the  strategy  in  the  previous  section  applies. 

When  estimating  differences  or  relative  effects,  the  bias  in  the  effect  es¬ 
timate,  besides  being  influenced  by  the  study  design,  is  related  to  how  well 
subject  heterogeneity  and  confounding  are  taken  into  account.  The  variance 
of  the  effect  estimate  is  related  to  the  distribution  of  the  variable  whose  levels 
are  being  compared,  and,  in  least  squares  estimates,  to  the  amount  of  vari¬ 
ation  “explained”  by  the  entire  set  of  predictors.  Variance  of  the  estimated 
difference  can  increase  if  there  is  overfitting.  So  for  estimation,  the  previous 
strategy  largely  applies. 

The  following  are  differences  in  the  modeling  strategy  when  effect  estima¬ 
tion  is  the  goal. 

1.  There  is  even  less  gain  from  having  a  parsimonious  model  than  when  de¬ 
veloping  overall  predictive  models,  as  estimation  is  usually  done  at  the 
time  of  analysis.  Leaving  insignificant  predictors  in  the  model  increases 
the  likelihood  that  the  confidence  interval  for  the  effect  of  interest  has  the 
stated  coverage.  By  contrast,  overall  predictions  are  conditional  on  the 
values  of  all  predictors  in  the  model.  The  variance  of  such  predictions  is 
increased  by  the  presence  of  unimportant  variables,  as  predictions  are  still 
conditional  on  the  particular  values  of  these  variables  (Section  5.5.1)  and 
cancellation  of  terms  (which  occurs  when  differences  are  of  interest)  does 
not  occur. 

2.  Careful  consideration  of  inclusion  of  interactions  is  still  a  major  consid¬ 
eration  for  estimation.  If  a  predictor  whose  effects  are  of  major  interest 
is  allowed  to  interact  with  one  or  more  other  predictors,  effect  estimates 
must  be  conditional  on  the  values  of  the  other  predictors  and  hence  have 
higher  variance. 

3.  A  major  goal  of  imputation  is  to  avoid  lowering  the  sample  size  because 
of  missing  values  in  adjustment  variables.  If  the  predictor  of  interest  is  the 
only  variable  having  a  substantial  number  of  missing  values,  multiple  im¬ 
putation  is  less  worthwhile,  unless  it  corrects  for  a  substantial  bias  caused 
by  deletion  of  nonrandomly  missing  data. 
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4.  The  analyst  need  not  be  very  concerned  about  conserving  degrees  of  free¬ 
dom  devoted  to  the  predictor  of  interest.  The  complexity  allowed  for  this 
variable  is  usually  determined  by  prior  beliefs,  with  compromises  that  con¬ 
sider  the  bias- variance  trade-off. 

5.  If  penalized  estimation  is  used,  the  analyst  may  wish  to  not  shrink  param¬ 
eter  estimates  for  the  predictor  of  interest. 

6.  Model  validation  is  not  necessary  unless  the  analyst  wishes  to  use  it  to 
quantify  the  degree  of  overfitting. 


4-12.3  Developing  Models  for  Hypothesis  Testing 

A  default  strategy  for  developing  a  multivariable  model  that  is  to  be  used 

as  a  basis  for  hypothesis  testing  is  almost  the  same  as  the  strategy  used  for 

estimation. 

1.  There  is  little  concern  for  parsimony.  A  full  model  fit,  including  insignifi¬ 
cant  variables,  will  result  in  more  accurate  P- values  for  tests  for  the  vari¬ 
ables  of  interest. 

2.  Careful  consideration  of  inclusion  of  interactions  is  still  a  major  consid¬ 
eration  for  hypothesis  testing.  If  one  or  more  predictors  interacts  with  a 
variable  of  interest,  either  separate  hypothesis  tests  are  carried  out  over 
the  levels  of  the  interacting  factors,  or  a  combined  “main  effect  +  interac¬ 
tion”  test  is  performed.  For  example,  a  very  well-defined  test  is  whether 
treatment  is  effective  for  any  race  group. 

3.  If  the  predictor  of  interest  is  the  only  variable  having  a  substantial  number 
of  missing  values,  multiple  imputation  is  less  worthwhile.  In  some  cases, 
multiple  imputation  may  increase  power  (e.g.,  in  ordinary  multiple  regres¬ 
sion  one  can  obtain  larger  degrees  of  freedom  for  error)  but  in  others  there 
will  be  little  net  gain.  However,  the  test  can  be  biased  due  to  exclusion  of 
nonrandomly  missing  observations  if  imputation  is  not  done. 

4.  As  before,  the  analyst  need  not  be  very  concerned  about  conserving  degrees 
of  freedom  devoted  to  the  predictor  of  interest.  The  degrees  of  freedom 
allowed  for  this  variable  is  usually  determined  by  prior  beliefs,  with  careful 
consideration  of  the  trade-off  between  bias  and  power. 

5.  If  penalized  estimation  is  used,  the  analyst  should  not  shrink  parameter 
estimates  for  the  predictors  being  tested. 

6.  Model  validation  is  not  necessary  unless  the  analyst  wishes  to  use  it  to 
quantify  the  degree  of  overfitting.  This  may  shed  light  on  whether  there  is 
over  adjustment  for  confounders. 
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Some  good  general  references  that  address  modeling  strategies  are  [216,269,476, 
590]. 

Even  though  they  used  a  generalized  correlation  index  for  screening  variables 
and  not  for  transforming  them,  Hall  and  Miller249  present  a  related  idea,  com¬ 
puting  the  ordinary  R2  against  a  cubic  spline  transformation  of  each  potential 
predictor. 

Simulation  studies  are  needed  to  determine  the  effects  of  modifying  the  model 
based  on  assessments  of  “predictor  promise.”  Although  it  is  unlikely  that  this 
strategy  will  result  in  regression  coefficients  that  are  biased  high  in  absolute 
value,  it  may  on  some  occasions  result  in  somewhat  optimistic  standard  errors 
and  a  slight  elevation  in  type  I  error  probability.  Some  simulation  results  may 
be  found  on  the  Web  site.  Initial  promising  findings  for  least  squares  models 
for  two  uncorrelated  predictors  indicate  that  the  procedure  is  conservative  in 
its  estimation  of  a2  and  in  preserving  type  I  error. 

Verweij  and  van  Houwelingen040  and  Shao565  describe  how  cross-validation  can 
be  used  in  formulating  a  stopping  rule.  Luo  et  al.430  developed  an  approach  to 
tuning  forward  selection  by  adding  noise  to  Y . 

Roecker528  compared  forward  variable  selection  (FS)  and  all  possible  subsets 
selection  (APS)  with  full  model  fits  in  ordinary  least  squares.  APS  had  a  greater 
tendency  to  select  smaller,  less  accurate  models  than  FS.  Neither  selection  tech¬ 
nique  was  as  accurate  as  the  full  model  fit  unless  more  than  half  of  the  candidate 
variables  was  redundant  or  unnecessary. 

Wiegand668  showed  that  it  is  not  very  fruitful  to  try  different  stepwise  algo¬ 
rithms  and  then  to  be  comforted  by  agreements  in  some  of  the  variables  selected. 
It  is  easy  for  different  stepwise  methods  to  agree  on  the  wrong  set  of  variables. 
Other  results  on  how  variable  selection  affects  inference  may  be  found  in  Hurvich 
and  Tsai316  and  Breiman  [66,  Section  8.1]. 

Goring  et  al.227  presented  an  interesting  analysis  of  the  huge  bias  caused  by 
conditioning  analyses  on  statistical  significance  in  a  high-dimensional  genetics 
context. 

Steyerberg  et  al.589  have  comparisons  of  smoothly  penalized  estimators  with 
the  lasso  and  with  several  stepwise  variable  selection  algorithms. 

See  Weiss,656  Faraway,186  and  Chatheld100  for  more  discussions  of  the  effect  of 
not  prespecifying  models,  for  example,  dependence  of  point  estimates  of  effects 
on  the  variables  used  for  adjustment. 

Greenland241  provides  an  example  in  which  overfitting  a  logistic  model  resulted 
in  far  too  many  predictors  with  P  <  0.05. 

See  Peduzzi  et  al. 486,487  for  studies  of  the  relationship  between  “events  per 
variable”  and  types  I  and  II  error,  accuracy  of  variance  estimates,  and  accuracy 
of  normal  approximations  for  regression  coefficient  estimators.  Their  findings 
are  consistent  with  those  given  in  the  text  (but644  has  a  slightly  different  take), 
van  der  Ploeg  et  al.629  did  extensive  simulations  to  determine  the  events  per 
variable  ratio  needed  to  avoid  a  drop-off  (in  an  independent  test  sample)  in  more 
than  0.01  in  the  c-index,  for  a  variety  of  predictive  methods.  They  concluded 
that  support  vector  machines,  neural  networks,  and  random  forests  needed  far 
more  events  per  variable  to  achieve  freedom  from  overfitting  than  does  logistic 
regression,  and  that  recursive  partitioning  was  not  competitive.  Logistic  regres¬ 
sion  required  between  20  and  50  events  per  variable  to  avoid  overfitting.  Differ¬ 
ent  results  might  have  been  obtained  had  the  authors  used  a  proper  accuracy 
score. 

Copas  [122,  Eq.  8.5]  adds  2  to  the  numerator  of  Equation  4.3  (see  also  [504,631]). 
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An  excellent  discussion  about  such  indexes  may  be  found  in  http ://r. 789695. 
n4  .nabble .  com/Ad j usted-R-  squared- formula-  in- lm-td4656857 .  html  where 
J.  Lucke  points  out  that  R 2  tends  to  when  the  population  R 2  is  zero, 

but  R2 dj  converges  to  zero. 

Efron  [173,  Eq.  4.23]  and  van  Houwelingen  and  le  Cessie633  showed  that  the  av¬ 
erage  expected  optimism  in  a  mean  logarithmic  quality  score  for  a  p-predictor 
binary  logistic  model  is  p/n.  Taylor  et  al.600  showed  that  the  ratio  of  variances 
for  certain  quantities  is  proportional  to  the  ratio  of  the  number  of  parameters 
in  two  models.  Copas  stated  that  “Shrinkage  can  be  particularly  marked  when 
stepwise  fitting  is  used:  the  shrinkage  is  then  closer  to  that  expected  of  the 
full  regression  rather  than  of  the  subset  regression  actually  fitted.”122, 504, 631 
Spiegelhalter,582  in  arguing  against  variable  selection,  states  that  better  predic¬ 
tion  will  often  be  obtained  by  fitting  all  candidate  variables  in  the  final  model, 
shrinking  the  vector  of  regression  coefficient  estimates  towards  zero. 

See  Belsley  [46,  pp.  28-30]  for  some  reservations  about  using  VIF. 

Friedman  and  Wall208  discuss  and  provide  graphical  devices  for  explaining  sup¬ 
pression  by  a  predictor  not  correlated  with  the  response  but  that  is  correlated 
with  another  predictor.  Adjusting  for  a  suppressor  variable  will  increase  the 
predictive  discrimination  of  the  model.  Meinshausen453  developed  a  novel  hier¬ 
archical  approach  to  gauging  the  importance  of  collinear  predictors. 

For  incomplete  principal  component  regression  see  [101,  119,  120,  142,  144,320, 
325].  See396,686  for  sparse  principal  component  analysis  methods  in  which  con¬ 
straints  are  applied  to  loadings  so  that  some  of  them  are  set  to  zero.  The  latter 
reference  provides  a  principal  component  method  for  binary  data.  See246  for 
a  type  of  sparse  principal  component  analysis  that  also  encourages  loadings 
to  be  similar  for  a  group  of  highly  correlated  variables  and  allows  for  a  type 
of  variable  clustering. See  [390]  for  principal  surfaces.  Sliced  inverse  regression 
is  described  in  [104,119,120,189,403,404].  For  material  on  variable  cluster¬ 
ing  see  [142,144,268,441,539].  A  good  general  reference  on  cluster  analysis 
is  [634,  Chapter  11].  de  Leeuw  and  Mair  in  their  R  homals  package  [153]  have 
one  of  the  most  general  approaches  to  data  reduction  related  to  optimal  scaling. 
Their  approach  includes  nonlinear  principal  component  analysis  among  several 
other  multivariate  analyses. 

The  redundancy  analysis  described  here  is  related  to  principal  variables 448  but 
is  faster. 

Meinshausen453  developed  a  method  of  testing  the  importance  of  competing 
(collinear)  variables  using  an  interesting  automatic  clustering  procedure. 

The  R  ClustOfVar  package  by  Marie  Chavent,  Vanessa  Kuentz,  Benoit  Liquet, 
and  Jerome  Saracco  generalizes  variable  clustering  and  explicitly  handles  a  mix¬ 
ture  of  quantitative  and  categorical  predictors.  It  also  implements  bootstrap 
cluster  stability  analysis. 

Principal  components  are  commonly  used  to  summarize  a  cluster  of  variables. 
Vines643  developed  a  method  to  constrain  the  principal  component  coefficients 
to  be  integers  without  much  loss  of  explained  variability. 

Jolliffe324  presented  a  way  to  discard  some  of  the  variables  making  up  principal 
components.  Wang  and  Gehan649  presented  a  new  method  for  finding  subsets  of 
predictors  that  approximate  a  set  of  principal  components,  and  surveyed  other 
methods  for  simplifying  principal  components. 

See  D’Agostino  et  al.144  for  excellent  examples  of  variable  clustering  (including 
a  two-stage  approach)  and  other  data  reduction  techniques  using  both  statistical 
methods  and  subject-matter  expertise. 

Cook118  and  Pencina  et  ap490, 492,493  present  an  approach  for  judging  the 
added  value  of  new  variables  that  is  based  on  evaluating  the  extent  to  which 
the  new  information  moves  predicted  probabilities  higher  for  subjects  having 
events  and  lower  for  subjects  not  having  events.  But  see292,592. 
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26]  The  Hmisc  abs .  error  .pred  function  computes  a  variety  of  accuracy  measures 
based  on  absolute  errors. 

27]  Shen  et  al.56'  developed  an  “optimal  approximation”  method  to  make  correct 
inferences  after  model  selection. 


4.14  Problems 

Analyze  the  SUPPORT  dataset  (getHdat  a  (support))  as  directed  below  to  re¬ 
late  selected  variables  to  total  cost  of  the  hospitalization.  Make  sure  this 
response  variable  is  utilized  in  a  way  that  approximately  satisfies  the  assump¬ 
tions  of  normality-based  multiple  regression  so  that  statistical  inferences  will 
be  accurate.  See  problems  at  the  end  of  Chapters  3  and  7  of  the  text  for  more 
information.  Consider  as  predictors  mean  arterial  blood  pressure,  heart  rate, 
age,  disease  group,  and  coma  score. 

1.  Do  an  analysis  to  understand  interrelationships  among  predictors,  and  find 
optimal  scaling  (transformations)  that  make  the  predictors  better  relate 
to  each  other  (e.g.,  optimize  the  variation  explained  by  the  first  principal 
component). 

2.  Do  a  redundancy  analysis  of  the  predictors,  using  both  a  less  stringent  and 
a  more  stringent  approach  to  assessing  the  redundancy  of  the  multiple-level 
variable  disease  group. 

3.  Do  an  analysis  that  helps  one  determine  how  many  d.f.  to  devote  to  each 
predictor. 

4.  Fit  a  model,  assuming  the  above  predictors  act  additively,  but  do  not  as¬ 
sume  linearity  for  the  age  and  blood  pressure  effects.  Use  the  truncated 
power  basis  for  fitting  restricted  cubic  spline  functions  with  5  knots.  Esti¬ 
mate  the  shrinkage  coefficient  7. 

5.  Make  appropriate  graphical  diagnostics  for  this  model. 

6.  Test  linearity  in  age,  linearity  in  blood  pressure,  and  linearity  in  heart  rate, 
and  also  do  a  joint  test  of  linearity  simultaneously  in  ah  three  predictors. 

7.  Expand  the  model  to  not  assume  additivity  of  age  and  blood  pressure. 
Use  a  tensor  natural  spline  or  an  appropriate  restricted  tensor  spline.  If 
you  run  into  any  numerical  difficulties,  use  4  knots  instead  of  5.  Plot  in  an 
interpretable  fashion  the  estimated  3-D  relationship  between  age,  blood 
pressure,  and  cost  for  a  fixed  disease  group. 

8.  Test  for  additivity  of  age  and  blood  pressure.  Make  a  joint  test  for  the 
overall  absence  of  complexity  in  the  model  (linearity  and  additivity  simul¬ 
taneously). 


Chapter  5 

Describing,  Resampling,  Validating, 
and  Simplifying  the  Model 


5.1  Describing  the  Fitted  Model 

5.1.1  Interpreting  Effects 

Before  addressing  issues  related  to  describing  and  interpreting  the  model 
and  its  coefficients,  one  can  never  apply  too  much  caution  in  attempting  to 
interpret  results  in  a  causal  manner.  Regression  models  are  excellent  tools 
for  estimating  and  inferring  associations  between  an  X  and  Y  given  that  the 
“right”  variables  are  in  the  model.  Any  ability  of  a  model  to  provide  causal 
inference  rests  entirely  on  the  faith  of  the  analyst  in  the  experimental  design, 
completeness  of  the  set  of  variables  that  are  thought  to  measure  confounding 
and  are  used  for  adjustment  when  the  experiment  is  not  randomized,  lack  of 
important  measurement  error,  and  lastly  the  goodness  of  fit  of  the  model. 

The  first  line  of  attack  in  interpreting  the  results  of  a  multivariable  analysis 
is  to  interpret  the  model’s  parameter  estimates.  For  simple  linear,  additive 
models,  regression  coefficients  may  be  readily  interpreted.  If  there  are  in¬ 
teractions  or  nonlinear  terms  in  the  model,  however,  simple  interpretations 
are  usually  impossible.  Many  programs  ignore  this  problem,  routinely  print¬ 
ing  such  meaningless  quantities  as  the  effect  of  increasing  age2  by  one  day 
while  holding  age  constant.  A  meaningful  age  change  needs  to  be  chosen,  and 
connections  between  mathematically  related  variables  must  be  taken  into 
account.  These  problems  can  be  solved  by  relying  on  predicted  values  and 
differences  between  predicted  values. 

Even  when  the  model  contains  no  nonlinear  effects,  it  is  difficult  to  com¬ 
pare  regression  coefficients  across  predictors  having  varying  scales.  Some  an¬ 
alysts  like  to  gauge  the  relative  contributions  of  different  predictors  on  a 
common  scale  by  multiplying  regression  coefficients  by  the  standard  devia¬ 
tions  of  the  predictors  that  pertain  to  them.  This  does  not  make  sense  for 
nonnormally  distributed  predictors  (and  regression  models  should  not  need 
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to  make  assumptions  about  the  distributions  of  predictors).  When  a  predic¬ 
tor  is  binary  (e.g.,  sex),  the  standard  deviation  makes  no  sense  as  a  scaling 
factor  as  the  scale  would  depend  on  the  prevalence  of  the  predictor.21 

It  is  more  sensible  to  estimate  the  change  in  Y  when  Xj  is  changed  by 
an  amount  that  is  subject-matter  relevant.  For  binary  predictors  this  is  a 
change  from  0  to  1.  For  many  continuous  predictors  the  interquartile  range 
is  a  reasonable  default  choice.  If  the  0.25  and  0.75  quantiles  of  Xj  are  g  and 
h,  linearity  holds,  and  the  estimated  coefficient  of  Xj  is  b;  b  x  (h  —  g)  is  the 
effect  of  increasing  Xj  by  h  —  g  units,  which  is  a  span  that  contains  half  of 
the  sample  values  of  Xj. 

For  the  more  general  case  of  continuous  predictors  that  are  monotonically 
but  not  linearly  related  to  T,  a  useful  point  summary  is  the  change  in  X/3 
when  the  variable  changes  from  its  0.25  quantile  to  its  0.75  quantile.  For 
models  for  which  exp (X/3)  is  meaningful,  antilogging  the  predicted  change  in 
X/3  results  in  quantities  such  as  interquartile-range  odds  and  hazards  ratios. 
When  the  variable  is  involved  in  interactions,  these  ratios  are  estimated  sep¬ 
arately  for  various  levels  of  the  interacting  factors.  For  categorical  predictors, 
ordinary  effects  are  computed  by  comparing  each  level  of  the  predictor  with 
a  reference  level.  See  Section  10.10  and  Chapter  11  for  tabular  and  graphical 
examples  of  this  approach. 

The  model  can  be  described  using  partial  effect  plots  by  plotting  each  X 

A 

against  X/3  holding  other  predictors  constant.  Modified  versions  of  such  plots, 
by  nonlinear ly  rank-transforming  the  predictor  axis,  can  show  the  relative 
importance  of  a  predictor336. 

For  an  X  that  interacts  with  other  factors,  separate  curves  are  drawn  on 
the  same  graph,  one  for  each  level  of  the  interacting  factor. 

Nomograms40,254,339,427  provide  excellent  graphical  depictions  of  all  the 
variables  in  the  model,  in  addition  to  enabling  the  user  to  obtain  predicted 
values  manually.  Nomograms  are  especially  good  at  helping  the  user  envision 
interactions.  See  Section  10.10  and  Chapter  11  for  examples. 


5.1.2  Indexes  of  Model  Performance 

5. 1.2.1  Error  Measures 

Care  must  be  taken  in  the  choice  of  accuracy  scores  to  be  used  in  validation. 

Indexes  can  be  broken  down  into  three  main  areas. 

Central  tendency  of  prediction  errors:  These  measures  include  mean  abso¬ 
lute  differences,  mean  squared  differences,  and  logarithmic  scores.  An  ab- 
solute  measure  is  mean  |  Y  —  Y  |.  The  mean  squared  error  is  a  commonly 
used  and  sensitive  measure  if  there  are  no  outliers.  For  the  special  case 

a  The  s.d.  of  a  binary  variable  is,  aside  from  a  multiplier  of  f  ,  equal  to  \/ a{  1  —  a), 

Tf  JL  * 

where  a  is  the  proportion  of  ones. 
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where  Y  is  binary,  such  a  measure  is  the  Brier  score,  which  is  a  quadratic 
proper  scoring  rule  that  combines  calibration  and  discrimination b.  The 
logarithmic  proper  scoring  rules  (related  to  average  log-likelihood)  is  even 
more  sensitive  but  can  be  harder  to  interpret  and  can  be  destroyed  by  a 
single  predicted  probability  of  0  or  1  that  was  incorrect. 

Discrimination  measures:  A  measure  of  pure  discrimination  is  a  rank  corre- 

/\ 

lation  of  Y  and  K,  including  Spearman’s  p,  Kendall’s  r,  and  Somers’  Dxy. 
When  Y  is  binary,  Dxy  =  2  x  (c  —  |)  where  c  is  the  concordance  prob¬ 
ability  or  area  under  the  receiver  operating  characteristic  curve,  a  linear 
translation  of  the  Wilcoxon-Mann- Whitney  statistic.  R2  is  mostly  a  mea¬ 
sure  of  discrimination,  and  R2d j  is  is  a  good  overfitting-corrected  measure, 
if  the  model  is  pre-specihed.  See  Section  10.8  for  more  information  about 
rank-based  measures. 

A 

Discrimination  measures  based  on  variation  in  Y :  These  include  the  regres¬ 
sion  sum  of  squares  and  the  p-Index  (see  below). 

Calibration  measures:  These  assess  absolute  prediction  accuracy. 

A 

Calibration-in-the-large  compares  the  average  Y  with  the  average  Y . 
A  high-resolution  calibration  curve  or  calibration-in-the- small  assesses  the 

A 

absolute  forecast  accuracy  of  predictions  at  individual  levels  of  Y.  When 
the  calibration  curve  is  linear,  this  can  be  summarized  by  the  calibration 
slope  and  intercept.  A  more  general  approach  uses  the  loess  nonpar ametric 
smoother  to  estimate  the  calibration  curve37.  For  any  shape  of  calibration 
curve,  errors  can  be  summarized  by  quantities  such  as  the  maximum  ab¬ 
solute  calibration  error,  mean  absolute  calibration  error,  and  0.9  quantile 
of  calibration  error. 


The  p-index  is  a  new  measure  of  a  model’s  predictive  discrimination  based 

A  A 

only  on  A/3  =  Y  that  applies  quite  generally.  It  is  based  on  Gini’s  mean 
difference  for  a  variable  Z,  which  is  the  mean  over  ah  possible  i  ^  j  of  \Zi  — 
Zj\.  The  p-index  is  an  interpretable,  robust,  and  highly  efficient  measure  of 

variation.  For  example,  when  predicting  systolic  blood  pressure,  g  =  llmmHg 

/\ 

represents  a  typical  difference  in  Y.  g  is  independent  of  censoring  and  other 


complexities.  For  models  in  which  the  anti-log  of  a  difference  in  Y  represents 
meaningful  ratios  (e.g.,  odds  ratios,  hazard  ratios,  ratio  of  medians),  gr  can 

/s. 

be  defined  as  exp (g).  For  models  in  which  Y  can  be  turned  into  a  probability 
estimate  (e.g.,  logistic  regression),  gp  is  defined  as  Gini’s  mean  difference  of 
P.  These  ^-indexes  represent  e.g.  “typical”  odds  ratios,  and  “typical”  risk 
differences.  Partial  g  indexes  can  also  be  defined.  More  details  may  be  found 
in  the  documentation  for  the  R  rms  package’s  glndex  function. 


b  There  are  decompositions  of  the  Brier  score  into  discrimination  and  calibration 
components. 
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5.2  The  Bootstrap 

When  one  assumes  that  a  random  variable  Y  has  a  certain  population  dis¬ 
tribution,  one  can  use  simulation  or  analytic  derivations  to  study  how  a  sta¬ 
tistical  estimator  computed  from  samples  from  this  distribution  behaves.  For 
example,  when  Y  has  a  log-normal  distribution,  the  variance  of  the  sample 
median  for  a  sample  of  size  n  from  that  distribution  can  be  derived  analyt¬ 
ically.  Alternatively,  one  can  simulate  500  samples  of  size  n  from  the  log¬ 
normal  distribution,  compute  the  sample  median  for  each  sample,  and  then 
compute  the  sample  variance  of  the  500  sample  medians.  Either  case  requires 
knowledge  of  the  population  distribution  function. 

Efron’s  bootstrap 150, 177,  178  is  a  general-purpose  technique  for  obtaining  es¬ 
timates  of  the  properties  of  statistical  estimators  without  making  assumptions 
about  the  distribution  giving  rise  to  the  data.  Suppose  that  a  random  variable 
Y  comes  from  a  cumulative  distribution  function  F(y)  =  Prob{Y  <  y}  and 
that  we  have  a  sample  of  size  n  from  this  unknown  distribution,  Yi,  Y2,  •  •  •  >  Yn. 
The  basic  idea  is  to  repeatedly  simulate  a  sample  of  size  n  from  E,  computing 
the  statistic  of  interest,  and  assessing  how  the  statistic  behaves  over  B  rep¬ 
etitions.  Not  having  F  at  our  disposal,  we  can  estimate  F  by  the  empirical 
cumulative  distribution  function 


1 

Fn(y)  =  -  $>*<  »]•  M) 

%— 1 

Fn  corresponds  to  a  density  function  that  places  probability  1/n  at  each 
observed  datapoint  (k/n  if  that  point  were  duplicated  k  times  and  its  value 
listed  only  once). 

As  an  example,  consider  a  random  sample  of  size  n  =  30  from  a  normal 
distribution  with  mean  100  and  standard  deviation  10.  Figure  5.1  shows  the 
population  and  empirical  cumulative  distribution  functions. 

Now  pretend  that  Fn(y)  is  the  original  population  distribution  F(y).  Sam¬ 
pling  from  Fn  is  equivalent  to  sampling  with  replacement  from  the  observed 
data  Yi, . . . ,  Yn.  For  large  n,  the  expected  fraction  of  original  datapoints  that 
are  selected  for  each  bootstrap  sample  is  1  —  e-1  =  0.632.  Some  points  are 
selected  twice,  some  three  times,  a  few  four  times,  and  so  on.  We  take  B  sam¬ 
ples  of  size  n  with  replacement,  with  B  chosen  so  that  the  summary  measure 
of  the  individual  statistics  is  nearly  as  good  as  taking  B  =  00.  The  bootstrap 
is  based  on  the  fact  that  the  distribution  of  the  observed  differences  between  a 
resampled  estimate  of  a  parameter  of  interest  and  the  original  estimate  of  the 
parameter  from  the  whole  sample  tells  us  about  the  distribution  of  unobserv¬ 
able  differences  between  the  original  estimate  and  the  unknown  population 
value  of  the  parameter. 

As  an  example,  consider  the  data  (1,  5,  6,  7,  8,  9)  and  suppose  that  we  would 
like  to  obtain  a  0.80  confidence  interval  for  the  population  median,  as  well  as 
an  estimate  of  the  population  expected  value  of  the  sample  median  (the  latter 
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Fig.  5.1  Empirical  and  population  cumulative  distribution  function 


is  only  used  to  estimate  bias  in  the  sample  median).  The  first  20  bootstrap 
samples  (after  sorting  data  values)  and  the  corresponding  sample  medians 
are  shown  in  Table  5.1. 

For  a  given  number  B  of  bootstrap  samples,  our  estimates  are  simply 
the  sample  0.1  and  0.9  quantiles  of  the  sample  medians,  and  the  mean  of 
the  sample  medians.  Not  knowing  how  large  B  should  be,  we  could  let  B 
range  from,  say,  50  to  1000,  stopping  when  we  are  sure  the  estimates  have 
converged.  In  the  left  plot  of  Figure  5.2,  B  varies  from  1  to  400  for  the  mean 
(10  to  400  for  the  quantiles).  It  can  be  seen  that  the  bootstrap  estimate  of  the 
population  mean  of  the  sample  median  can  be  estimated  satisfactorily  when 
B  >  50.  For  the  lower  and  upper  limits  of  the  0.8  confidence  interval  for  the 
population  median  H,  B  must  be  at  least  200.  For  more  extreme  confidence 
limits,  B  must  be  higher  still. 

For  the  final  set  of  400  sample  medians,  a  histogram  (right  plot  in  Fig¬ 
ure  5.2)  can  be  used  to  assess  the  form  of  the  sampling  distribution  of  the 
sample  median.  Here,  the  distribution  is  almost  normal,  although  there  is  a 
slightly  heavy  left  tail  that  comes  from  the  data  themselves  having  a  heavy  left 
tail.  For  large  samples,  sample  medians  are  normally  distributed  for  a  wide 
variety  of  population  distributions.  Therefore  we  could  use  bootstrapping  to 
estimate  the  variance  of  the  sample  median  and  then  take  ±1.28  standard 
errors  as  a  0.80  confidence  interval.  In  other  cases  (e.g.,  regression  coefficient 
estimates  for  certain  models),  estimates  are  asymmetrically  distributed,  and 
the  bootstrap  quantiles  are  better  estimates  than  confidence  intervals  that 
are  based  on  a  normality  assumption.  Note  that  because  sample  quantiles 
are  more  or  less  restricted  to  equal  one  of  the  values  in  the  sample,  the  boot- 
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Bootstrap  Samples  Used 


2  4  6  8 


Fig.  5.2  Estimating  properties  of  sample  median  using  the  bootstrap 


Table  5.1  First  20  bootstrap  samples 


Bootstrap  Sample 

Sample  Median 

166789 

6.5 

155568 

5.0 

578999 

8.5 

777889 

7.5 

157799 

7.0 

156678 

6.0 

788888 

8.0 

555799 

6.0 

155779 

6.0 

155778 

6.0 

115577 

5.0 

115578 

5.0 

155778 

6.0 

156788 

6.5 

156799 

6.5 

667789 

7.0 

157889 

7.5 

668999 

8.5 

115569 

5.0 

168999 

8.5 
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strap  distribution  is  discrete  and  can  be  dependent  on  a  small  number  of 
outliers.  For  this  reason,  bootstrapping  quantiles  does  not  work  particularly 
well  for  small  samples  [150,  pp.  41-43]. 

The  method  just  presented  for  obtaining  a  nonpar ametric  confidence  in¬ 
terval  for  the  population  median  is  called  the  bootstrap  percentile  method.  It 
is  the  simplest  but  not  necessarily  the  best  performing  bootstrap  method. 

In  this  text  we  use  the  bootstrap  primarily  for  computing  statistical  esti¬ 
mates  that  are  much  different  from  standard  errors  and  confidence  intervals, 
namely,  estimates  of  model  performance. 


5.3  Model  Validation 


5.3.1  Introduction 


The  surest  method  to  have  a  model  fit  the  data  at  hand  is  to  discard  much 
of  the  data.  A  p- variable  fit  to  p  +  1  observations  will  perfectly  predict  Y  as 
long  as  no  two  observations  have  the  same  Y.  Such  a  model  will,  however, 
yield  predictions  that  appear  almost  random  with  respect  to  responses  on 
a  different  dataset.  Therefore,  unbiased  estimates  of  predictive  accuracy  are 
essential. 

Model  validation  is  done  to  ascertain  whether  predicted  values  from  the 
model  are  likely  to  accurately  predict  responses  on  future  subjects  or  sub¬ 
jects  not  used  to  develop  our  model.  Three  major  causes  of  failure  of  the 
model  to  validate  are  overfitting,  changes  in  measurement  methods/changes 
in  definition  of  categorical  variables,  and  major  changes  in  subject  inclusion 
criteria. 

There  are  two  major  modes  of  model  validation,  external  and  internal  The 
most  stringent  external  validation  involves  testing  a  final  model  developed  in 
one  country  or  setting  on  subjects  in  another  country  or  setting  at  another 
time.  This  validation  would  test  whether  the  data  collection  instrument  was 
translated  into  another  language  properly,  whether  cultural  differences  make 
earlier  findings  nonapplicable,  and  whether  secular  trends  have  changed  as¬ 
sociations  or  base  rates.  Testing  a  finished  model  on  new  subjects  from  the 
same  geographic  area  but  from  a  different  institution  as  subjects  used  to  fit 
the  model  is  a  less  stringent  form  of  external  validation.  The  least  stringent 
form  of  external  validation  involves  using  the  first  m  of  n  observations  for 
model  training  and  using  the  remaining  n  —  m  observations  as  a  test  sample. 
This  is  very  similar  to  data-splitting  (Section  5.3.3).  For  details  about  meth¬ 
ods  for  external  validation  see  the  R  val.prob  and  val.surv  functions  in  the 
rms  package. 

Even  though  external  validation  is  frequently  favored  by  non-statisticians, 
it  is  often  problematic.  Holding  back  data  from  the  model-fitting  phase  re- 
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suits  in  lower  precision  and  power,  and  one  can  increase  precision  and  learn 
more  about  geographic  or  time  differences  by  fitting  a  unified  model  to  the 
entire  subject  series  including,  for  example,  country  or  calendar  time  as  a 
main  effect  and/or  as  an  interacting  effect.  Indeed  one  could  use  the  follow¬ 
ing  working  definition  of  external  validation:  validation  of  a  prediction  tool 
using  data  that  were  not  available  when  the  tool  needed  to  be  completed.  An 
alternate  definition  could  be  taken  as  the  validation  of  a  prediction  tool  by 
an  independent  research  team. 

One  suggested  hierarchy  of  the  quality  of  various  validation  methods  is  as 
follows,  ordered  from  worst  to  best. 

1.  Attempting  several  validations  (internal  or  external)  and  reporting  only 
the  one  that  “worked” 

2.  Reporting  apparent  performance  on  the  training  dataset  (no  validation) 

3.  Reporting  predictive  accuracy  on  an  undersized  independent  test  sample 

4.  Internal  validation  using  data- splitting  where  at  least  one  of  the  training 
and  test  samples  is  not  huge  and  the  investigator  is  not  aware  of  the 
arbitrariness  of  variable  selection  done  on  a  single  sample 

5.  Strong  internal  validation  using  100  repeats  of  10-fold  cross-validation  or 
several  hundred  bootstrap  resamples,  repeating  all  analysis  steps  involving 
Y  afresh  at  each  re-sample  and  the  arbitrariness  of  selected  “important 
variables”  is  reported  (if  variable  selection  is  used) 

6.  External  validation  on  a  large  test  sample,  done  by  the  original  research 
team 

7.  Re-analysis  by  an  independent  research  team  using  strong  internal  valida¬ 
tion  of  the  original  dataset 

8.  External  validation  using  new  test  data,  done  by  an  independent  research 
team 

9.  External  validation  using  new  test  data  generated  using  different  instru- 
ments/technology,  done  by  an  independent  research  team 

Internal  validation  involves  fitting  and  validating  the  model  by  carefully 
using  one  series  of  subjects.  One  uses  the  combined  dataset  in  this  way  to 
estimate  the  likely  performance  of  the  final  model  on  new  subjects,  which 
after  all  is  often  of  most  interest.  Most  of  the  remainder  of  Section  5.3  deals 
with  internal  validation. 


5.3.2  Which  Quantities  Should  Be  Used 
in  Validation? 

For  ordinary  multiple  regression  models,  the  R 2  index  is  a  good  measure 
of  the  model’s  predictive  ability,  especially  for  the  purpose  of  quantifying 
drop-off  in  predictive  ability  when  applying  the  model  to  other  datasets. 
R 2  is  biased,  however.  For  example,  if  one  used  nine  predictors  to  predict 
outcomes  of  10  subjects,  R2  =  1.0  but  the  R?  that  will  be  achieved  on  future 
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subjects  will  be  close  to  zero.  In  this  case,  dramatic  overfitting  has  occurred. 
The  adjusted  R 2  (Equation  4.4)  solves  this  problem,  at  least  when  the  model 
has  been  completely  prespecified  and  no  variables  or  parameters  have  been 
“screened”  out  of  the  final  model  fit.  That  is,  R2dj  is  only  valid  when  p  in  its 
formula  is  honest —  when  it  includes  all  parameters  ever  examined  (formally 
or  informally,  e.g.,  using  graphs  or  tables)  whether  these  parameters  are  in 
the  final  model  or  not. 

Quite  often  we  need  to  validate  indexes  other  than  R 2  for  which  adjust¬ 
ments  for  p  have  not  been  created.0  We  also  need  to  validate  models  contain¬ 
ing  “phantom  degrees  of  freedom”  that  were  screened  out  earlier,  formally 
or  informally.  For  these  purposes,  we  obtain  nearly  unbiased  estimates  of  R2 
or  other  indexes  using  data  splitting,  cross-validation,  or  the  bootstrap.  The 
bootstrap  provides  the  most  precise  estimates. 

The  g-index  is  another  discrimination  measure  to  validate.  But  g  and  R2 
measures  only  one  aspect  of  predictive  ability.  In  general,  there  are  two  major 
aspects  of  predictive  accuracy  that  need  to  be  assessed.  As  discussed  in  Sec¬ 
tion  4.5,  calibration  or  reliability  is  the  ability  of  the  model  to  make  unbiased 
estimates  of  outcome.  Discrimination  is  the  model’s  ability  to  separate  sub¬ 
jects’  outcomes.  Validation  of  the  model  is  recommended  even  when  a  data 
reduction  technique  is  used.  This  is  a  way  to  ensure  that  the  model  was  not 
overfitted  or  is  otherwise  inaccurate. 


5.3.3  Data- Splitting 

The  simplest  validation  method  is  one-time  data- splitting .  Here  a  dataset  is 
split  into  training  (model  development)  and  test  (model  validation)  samples 
by  a  random  process  with  or  without  balancing  distributions  of  the  response 
and  predictor  variables  in  the  two  samples.  In  some  cases,  a  chronological 
split  is  used  so  that  the  validation  is  prospective.  The  model’s  calibration 
and  discrimination  are  validated  in  the  test  set. 

In  ordinary  least  squares,  calibration  may  be  assessed  by,  for  example, 
plotting  Y  against  Y.  Discrimination  here  is  assessed  by  R2  and  it  is  of 
interest  in  comparing  R2  in  the  training  sample  with  that  achieved  in  the 
test  sample.  A  drop  in  R?  indicates  overfitting,  and  the  absolute  R2  in  the 
test  sample  is  an  unbiased  estimate  of  predictive  discrimination.  Note  that 
in  extremely  overfitted  models,  R2  in  the  test  set  can  be  negative,  since  it  is 
computed  on  “frozen”  intercept  and  regression  coefficients  using  the  formula 
1  —  SSE / SST ,  where  SSE  is  the  error  sum  of  squares,  SST  is  the  total  sum 


c  For  example,  in  the  binary  logistic  model,  there  is  a  generalization  of  R 2  available, 
but  no  adjusted  version.  For  logistic  models  we  often  validate  other  indexes  such 
as  the  ROC  area  or  rank  correlation  between  predicted  probabilities  and  observed 

A 

outcomes.  We  also  validate  the  calibration  accuracy  of  Y  in  predicting  Y . 
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of  squares,  and  SSE  can  be  greater  than  SST  (when  predictions  are  worse 
than  the  constant  predictor  Y). 

To  be  able  to  validate  predictions  from  the  model  over  an  entire  test  sam¬ 
ple  (without  validating  it  separately  in  particular  subsets  such  as  in  males 
and  females),  the  test  sample  must  be  large  enough  to  precisely  fit  a  model 
containing  one  predictor.  For  a  study  with  a  continuous  uncensored  response 
variable,  the  test  sample  size  should  ordinarily  be  >  100  at  a  bare  minimum. 
For  survival  time  studies,  the  test  sample  should  at  least  be  large  enough 
to  contain  a  minimum  of  100  outcome  events.  For  binary  outcomes,  the  test 
sample  should  contain  a  bare  minimum  of  100  subjects  in  the  least  frequent 
outcome  category.  Once  the  size  of  the  test  sample  is  determined,  the  remain¬ 
ing  portion  of  the  original  sample  can  be  used  as  a  training  sample.  Even  with 
these  test  sample  sizes,  validation  of  extreme  predictions  is  difficult. 

Data-splitting  has  the  advantage  of  allowing  hypothesis  tests  to  be  con¬ 
firmed  in  the  test  sample.  However,  it  has  the  following  disadvantages. 

1.  Data-splitting  greatly  reduces  the  sample  size  for  both  model  development 
and  model  testing.  Because  of  this,  Roecker528  found  this  method  “appears 
to  be  a  costly  approach,  both  in  terms  of  predictive  accuracy  of  the  fitted 
model  and  the  precision  of  our  estimate  of  the  accuracy.”  Breiman  [66, 
Section  1.3]  found  that  bootstrap  validation  on  the  original  sample  was  as 
efficient  as  having  a  separate  test  sample  twice  as  large36. 

2.  It  requires  a  larger  sample  to  be  held  out  than  cross-validation  (see  be¬ 
low)  to  be  able  to  obtain  the  same  precision  of  the  estimate  of  predictive 
accuracy. 

3.  The  split  may  be  fortuitous;  if  the  process  were  repeated  with  a  different 
split,  different  assessments  of  predictive  accuracy  may  be  obtained. 

4.  Data-splitting  does  not  validate  the  final  model,  but  rather  a  model  devel¬ 
oped  on  only  a  subset  of  the  data.  The  training  and  test  sets  are  recombined 
for  fitting  the  final  model,  which  is  not  validated. 

5.  Data-splitting  requires  the  split  before  the  first  analysis  of  the  data.  With 
other  methods,  analyses  can  proceed  in  the  usual  way  on  the  complete 
dataset.  Then,  after  a  “final”  model  is  specified,  the  modeling  process  is 
rerun  on  multiple  resamples  from  the  original  data  to  mimic  the  process 
that  produced  the  “final”  model. 


5.3.4  Improvements  on  Data-Splitting:  Resampling 

Bootstrapping,  jackknifing,  and  other  resampling  plans  can  be  used  to  obtain 
nearly  unbiased  estimates  of  model  performance  without  sacrificing  sample 
size.  These  methods  work  when  either  the  model  is  completely  specified  ex¬ 
cept  for  the  regression  coefficients,  or  all  important  steps  of  the  modeling 
process,  especially  variable  selection,  are  automated.  Only  then  can  each 
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bootstrap  replication  be  a  reflection  of  all  sources  of  variability  in  model¬ 
ing.  Note  that  most  analyses  involve  examination  of  graphs  and  testing  for 
lack  of  model  fit,  with  many  intermediate  decisions  by  the  analyst  such  as 
simplification  of  interactions.  These  processes  are  difficult  to  automate.  But 
variable  selection  alone  is  often  the  greatest  source  of  variability  because  of 
multiple  comparison  problems,  so  the  analyst  must  go  to  great  lengths  to 
bootstrap  or  jackknife  variable  selection. 

The  ability  to  study  the  arbitrariness  of  how  a  stepwise  variable  selection 
algorithm  selects  “important”  factors  is  a  major  benefit  of  bootstrapping.  A 
useful  display  is  a  matrix  of  blanks  and  asterisks,  where  an  asterisk  is  placed 
in  column  x  of  row  i  if  variable  x  is  selected  in  bootstrap  sample  i  (see  p. 
263  for  an  example).  If  many  variables  appear  to  be  selected  at  random, 
the  analyst  may  want  to  turn  to  a  data  reduction  method  rather  than  using 
stepwise  selection  (see  also  [541]). 

Cross-validation  is  a  generalization  of  data-splitting  that  solves  some  of  the 
problems  of  data-splitting.  Leave-out-one  cross-validation,565,633  the  limit  of 
cross-validation,  is  similar  to  jackknifing.675  Here  one  observation  is  omitted 
from  the  analytical  process  and  the  response  for  that  observation  is  predicted 
using  a  model  derived  from  the  remaining  n  —  1  observations.  The  process 
is  repeated  n  times  to  obtain  an  average  accuracy.  Efron  2  reports  that 
grouped  cross-validation  is  more  accurate;  here  groups  of  k  observations  are 
omitted  at  a  time.  Suppose,  for  example,  that  10  groups  are  used.  The  orig¬ 
inal  dataset  is  divided  into  10  equal  subsets  at  random.  The  first  9  subsets 
are  used  to  develop  a  model  (transformation  selection,  interaction  testing, 
stepwise  variable  selection,  etc.  are  all  done).  The  resulting  model  is  assessed 
for  accuracy  on  the  remaining  l/10th  of  the  sample.  This  process  is  repeated 
at  least  10  times  to  get  an  average  of  10  indexes  such  as  R 2. 

A  drawback  of  cross-validation  is  the  choice  of  the  number  of  observations 
to  hold  out  from  each  fit.  Another  is  that  the  number  of  repetitions  needed  to 
achieve  accurate  estimates  of  accuracy  often  exceeds  200.  For  example,  one 
may  have  to  omit  ^th  of  the  sample  500  times  to  accurately  estimate  the 
index  of  interest  Thus  the  sample  would  need  to  be  split  into  tenths  50  times. 
Another  possible  problem  is  that  cross-validation  may  not  fully  represent  the 
variability  of  variable  selection.  If  20  subjects  are  omitted  each  time  from  a 
sample  of  size  1000,  the  lists  of  variables  selected  from  each  training  sample 
of  size  980  are  likely  to  be  much  more  similar  than  lists  obtained  from  fitting 
independent  samples  of  1000  subjects.  Finally,  as  with  data-splitting,  cross- 
validation  does  not  validate  the  full  1000-subject  model. 

An  interesting  way  to  study  overfitting  could  be  called  the  randomization 
method.  Here  we  ask  the  question  “How  well  can  the  response  be  predicted 
when  we  use  our  best  procedure  on  random  responses  when  the  predictive 
accuracy  should  be  near  zero?”  The  better  the  fit  on  random  Y,  the  worse  the 
overfitting.  The  method  takes  a  random  permutation  of  the  response  variable 
and  develops  a  model  with  optional  variable  selection  based  on  the  original  X 
and  permuted  Y.  Suppose  this  yields  R 2  =  .2  for  the  fitted  sample.  Apply  the 
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fit  to  the  original  data  to  estimate  optimism.  If  overfitting  is  not  a  problem, 
R2  would  be  the  same  for  both  fits  and  it  will  ordinarily  be  very  near  zero. 
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5.3.5  Validation  Using  the  Bootstrap 

Efron,172, 1/3  Efron  and  Gong,175  Gong,224  Efron  and  Tibshirani,177, 178  Lin¬ 
net,416  and  Breiman  3  describe  several  bootstrapping  procedures  for  obtain¬ 
ing  nearly  unbiased  estimates  of  future  model  performance  without  holding 
back  data  when  making  the  final  estimates  of  model  parameters.  With  the 
“simple  bootstrap”  [178,  p.  247],  one  repeatedly  fits  the  model  in  a  bootstrap 
sample  and  evaluates  the  performance  of  the  model  on  the  original  sample. 
The  estimate  of  the  likely  performance  of  the  final  model  on  future  data 
is  estimated  by  the  average  of  all  of  the  indexes  computed  on  the  original 
sample. 

Efron  showed  that  an  enhanced  bootstrap  estimates  future  model  per¬ 
formance  more  accurately  than  the  simple  bootstrap.  Instead  of  estimating 
an  accuracy  index  directly  from  averaging  indexes  computed  on  the  original 
sample,  the  enhanced  bootstrap  uses  a  slightly  more  indirect  approach  by 
estimating  the  bias  due  to  overfitting  or  the  “optimism”  in  the  final  model 
fit.  After  the  optimism  is  estimated,  it  can  be  subtracted  from  the  index 
of  accuracy  derived  from  the  original  sample  to  obtain  a  bias-corrected  or 
overfitting-corrected  estimate  of  predictive  accuracy.  The  bootstrap  method 
is  as  follows.  From  the  original  X  and  Y  in  the  sample  of  size  n,  draw  a 
sample  with  replacement  also  of  size  n.  Derive  a  model  in  the  bootstrap  sam¬ 
ple  and  apply  it  without  change  to  the  original  sample.  The  accuracy  index 
from  the  bootstrap  sample  minus  the  index  computed  on  the  original  sample 
is  an  estimate  of  optimism.  This  process  is  repeated  for  100  or  so  bootstrap 
replications  to  obtain  an  average  optimism,  which  is  subtracted  from  the  final 
model  fit’s  apparent  accuracy  to  obtain  the  overfitting-corrected  estimate. 

Note  that  bootstrapping  validates  the  process  that  was  used  to  fit  the  orig¬ 
inal  model  (as  does  cross-validation).  It  provides  an  estimate  of  the  expected 
value  of  the  optimism,  which  when  subtracted  from  the  original  index,  pro¬ 
vides  an  estimate  of  the  expected  bias-corrected  index.  If  stepwise  variable 
selection  is  part  of  the  bootstrap  process  (as  it  must  be  if  the  final  model 
is  developed  that  way),  and  not  all  resamples  (samples  with  replacement  or 
training  samples  in  cross-validation)  resulted  in  the  same  model  (which  is 
almost  always  the  case),  this  internal  validation  process  actually  provides  an 
unbiased  estimate  of  the  future  performance  of  the  process  used  to  identify 
markers  and  scoring  systems;  it  does  not  validate  a  single  final  model.  But 
resampling  does  tend  to  provide  good  estimates  of  the  future  performance  of 
the  final  model  that  was  selected  using  the  same  procedure  repeated  in  the 
resamples. 
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Note  that  by  drawing  samples  from  X  and  Y,  we  are  estimating  aspects 
of  the  unconditional  distribution  of  statistical  quantities.  One  could  instead 
draw  samples  from  quantities  such  as  residuals  from  the  model  to  obtain  a 
distribution  that  is  conditional  on  X.  However,  this  approach  requires  that 
the  model  be  specified  correctly,  whereas  the  unconditional  bootstrap  does 
not.  Also,  the  unconditional  estimators  are  similar  to  conditional  estimators 
except  for  very  skewed  or  very  small  samples  [186,  p.  217]. 

Bootstrapping  can  be  used  to  estimate  the  optimism  in  virtually  any  index. 
Besides  discrimination  indexes  such  as  i?2,  slope  and  intercept  calibration  fac¬ 
tors  can  be  estimated.  When  one  fits  the  model  C(Y \X)  =  Xf3,  and  then  refits 

the  model  C(Y\X)  =  70+71 X/3  on  the  same  data,  where  f3  is  an  estimate  of 

/\ 

/?,  70  and  71  will  necessarily  be  0  and  1,  respectively.  However,  when  f3  is  used 
to  predict  responses  on  another  dataset,  71  may  be  <  1  if  there  is  overfitting, 
and  70  will  be  different  from  zero  to  compensate.  Thus  a  bootstrap  estimate 
of  71  will  not  only  quantify  overfitting  nicely,  but  can  also  be  used  to  shrink 
predicted  values  to  make  them  more  calibrated  (similar  to  [582]).  Efron’s  op¬ 
timism  bootstrap  is  used  to  estimate  the  optimism  in  (0, 1)  and  then  (70,71) 
are  estimated  by  subtracting  the  optimism  in  the  constant  estimator  (0, 1). 


Note  that  in  cross-validation  one  estimates  (3  with  [3  from  the  training  sample 

/\ 

and  fits  C{Y\X)  =  yX/3  on  the  test  sample  directly.  Then  the  7  estimates  are 
averaged  over  all  test  samples.  This  approach  does  not  require  the  choice  of  a 
parameter  that  determines  the  amount  of  shrinkage  as  does  ridge  regression 
or  penalized  maximum  likelihood  estimation;  instead  one  estimates  how  to 
make  the  initial  fit  well  calibrated.123,63"  However,  this  approach  is  not  as 
reliable  as  building  shrinkage  into  the  original  estimation  process.  The  latter 
allows  different  parameters  to  be  shrunk  by  different  factors. 

Ordinary  bootstrapping  can  sometimes  yield  overly  optimistic  estimates 
of  optimism,  that  is,  may  underestimate  the  amount  of  overfitting.  This  is 
especially  true  when  the  ratio  of  the  number  of  observations  to  the  number 
of  parameters  estimated  is  not  large.205  A  variation  on  the  bootstrap  that 
improves  precision  of  the  assessment  is  the  “.632”  method,  which  Efron  found 
to  be  optimal  in  several  examples.1'2  This  method  provides  a  bias-corrected 
estimate  of  predictive  accuracy  by  substituting  0.632 x  [apparent  accuracy 
— eo]  for  the  estimate  of  optimism,  where  eo  is  a  weighted  average  of  accuracies 
evaluated  on  observations  omitted  from  bootstrap  samples  [178,  Eq.  17.25, 
p.  253]. 

For  ordinary  least  squares,  where  the  genuine  per-observation  .632  estima¬ 
tor  can  be  used,  several  simulations  revealed  close  agreement  with  the  mod¬ 
ified  .632  estimator,  even  in  small,  highly  overfitted  samples.  In  these  over¬ 
fitted  cases,  the  ordinary  bootstrap  bias-corrected  accuracy  estimates  were 
significantly  higher  than  the  .632  estimates.  Simulations259, 591  have  shown, 
however,  that  for  most  types  of  indexes  of  accuracy  of  binary  logistic  regres¬ 
sion  models,  Efron’s  original  bootstrap  has  lower  mean  squared  error  than 
the  .632  bootstrap  when  n  =  200,  p  =  30.  Bootstrap  overfitting-corrected  es¬ 
timates  of  model  performance  can  be  biased  in  favor  of  the  model.  Although 
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Table  5.2  Example  validation  with  and  without  variable  selection 


Method 

Apparent  Rank 

Over- 

Bias-Corrected 

Correlation  of  Optimism 
Predicted  vs. 

Observed 

Correlation 

Full  Model 

0.50 

0.06 

0.44 

Stepwise  Model 

0.47 

0.05 

0.42 

19 


20 


cross-validation  is  less  biased  than  the  bootstrap,  Efron  2  showed  that  it  has 
much  higher  variance  in  estimating  overfitting-corrected  predictive  accuracy 
than  bootstrapping.  In  other  words,  cross-validation,  like  data-splitting,  can 
yield  significantly  different  estimates  when  the  entire  validation  process  is 
repeated. 

It  is  frequently  very  informative  to  estimate  a  measure  of  predictive  accu¬ 
racy  forcing  all  candidate  factors  into  the  fit  and  then  to  separately  estimate 
accuracy  allowing  stepwise  variable  selection,  possibly  with  different  stop¬ 
ping  rules.  Consistent  with  Spiegelhalter’s  proposal  to  use  all  factors  and 
then  to  shrink  the  coefficients  to  adjust  for  overfitting,582  the  full  model  fit 
will  outperform  the  stepwise  model  more  often  than  not.  Even  though  step¬ 
wise  modeling  has  slightly  less  optimism  in  predictive  discrimination,  this 
improvement  is  not  enough  to  offset  the  loss  of  information  from  deleting 
even  marginally  important  variables.  Table  5.2  shows  a  typical  scenario.  In 
this  example,  stepwise  modeling  lost  a  possible  0.50  —  0.47  =  0.03  predictive 
discrimination.  The  full  model  fit  will  especially  be  an  improvement  when 

1.  the  stepwise  selection  deletes  several  variables  that  are  almost  significant; 

2.  these  marginal  variables  have  some  real  predictive  value,  even  if  it’s  slight; 

and 

3.  there  is  no  small  set  of  extremely  dominant  variables  that  would  be  easily 

found  by  stepwise  selection. 

Faraway186  has  a  fascinating  study  showing  how  resampling  methods  can 
be  used  to  estimate  the  distributions  of  predicted  values  and  of  effects  of  a 
predictor,  adjusting  for  an  automated  multistep  modeling  process.  Bootstrap¬ 
ping  can  be  used,  for  example,  to  penalize  the  variance  in  predicted  values  for 
choosing  a  transformation  for  Y  and  for  outlier  and  influential  observation 
deletion,  in  addition  to  variable  selection.  Estimation  of  the  transformation  of 
Y  greatly  increased  the  variance  in  Faraway’s  examples.  Brownstone  [77,  p. 
74]  states  that  “In  spite  of  considerable  efforts,  theoretical  statisticians  have 
been  unable  to  analyze  the  sampling  properties  of  [usual  multistep  modeling 
strategies]  under  realistic  conditions”  and  concludes  that  the  modeling  strat¬ 
egy  must  be  completely  specified  and  then  bootstrapped  to  get  consistent 
estimates  of  variances  and  other  sampling  properties. 
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5.4  Bootstrapping  Ranks  of  Predictors 


When  the  order  of  importance  of  predictors  is  not  pre-specified  but  the  re¬ 
searcher  attempts  to  determine  that  order  by  assessing  multiple  associations 
with  y,  the  process  of  selecting  “winners”  and  “losers”  is  unreliable.  The 
bootstrap  can  be  used  to  demonstrate  the  difficulty  of  this  task,  by  estimat¬ 
ing  confidence  intervals  for  the  ranks  of  all  the  predictors.  Even  though  the 
bootstrap  intervals  are  wide,  they  actually  underestimate  the  true  widths250. 

The  following  exampling  uses  simulated  data  with  known  ranks  of  impor¬ 
tance  of  12  predictors,  using  an  ordinary  linear  model.  The  importance  metric 
is  the  partial  y2  minus  its  degrees  of  freedom,  while  the  true  metric  is  the 
partial  /3,  as  all  covariates  have  £7(0, 1)  distributions. 


#  Use  the  plot  method  for  anova,  with  pl=FALSE  to  suppress 

#  actual  plotting  of  chi-square  -  d.f.  for  each  bootstrap 

#  repetition .  Rank  the  negative  of  the  adjusted  chi-squares 

#  so  that  a  rank  of  1  is  assigned  to  the  highest.  It  is 

#  important  to  tell  plot . anova . rms  not  to  sort  the  results  , 

#  or  every  bootstrap  replication  would  have  ranks  of  1,2,3, 

#  ...  for  the  partial  test  statistics. 

require ( rms ) 

n  i —  300 

set . seed  ( 1 ) 

d  V-  data . frame (xl=runif (n) , 
x4=runif (n) ,  x5=runif (n) , 
x8=runif (n) ,  x9=runif (n) , 
xl2=runif (n) ) 

d$y  with(d,  l*xl  +  2*x2  + 


x2=runif (n) , 
x6=runif (n) , 


x3=runif (n) , 
x7  =  runif (n)  , 


xl0=runif (n) ,  xl 1 =runif (n) , 


7  *  x7  +  8  *  x8  + 


3  *  x3 
9  *  x9 


+ 

+ 


4  *  x4 
10  *xl0 


+  5*x5  +  6*x6 
+  1 1  *  x  1 1  + 


12  *xl2  +  9*rnorm (n) ) 


f  V-  ols  (y  ~  xl+x2+x3+x4+x5+x6+x7+x8+x9+xl0+xll +xl2 ,  data  =  d) 

B  V-  1000 

ranks  <—  matrix (NA ,  nrow=B ,  ncol=12) 
rankvars  V-  f unct i on ( f it ) 

r ank ( plot ( anova ( f it )  ,  sort= 'none  1  ,  pl  =  FALSE)) 

Rank  rankvars  (f) 
for(i  in  1 : B )  { 

j  sample  (l:n,  n,  TRUE) 

bootfit  update (f ,  data=d,  subset=j) 

ranks  [i,]  rankvars  ( boot  f  it  ) 

1 

lim  t  (  apply  ( ranks  ,  2,  quantile,  pr  obs  =  c  (  .  025  ,  .  975  )  )  ) 

predictor  f  act  or  ( name  s  ( Rank  )  ,  names  (Rank)) 

w  data . frame  (predictor  ,  Rank,  lower  =  lim  [,  1]  ,  upper  =  1 im  [  , 2] ) 
require (ggplot2) 

ggplot(w,  aes (x=predictor ,  y=Rank))  +  geom_point()  + 
coord_flip()  +  s c al e _y_ c ont inuous ( breaks = 1 : 1 2 )  + 

geom.err orbar ( aes ( ymin  =  lim [ , 1]  ,  ymax  =  lim  [  , 2] )  ,  width=0) 


With  a  sample  size  of  n  =  300  the  observed  ranks  of  predictor  importance  do 
not  coincide  with  population  /3s,  even  when  there  are  no  collinearities  among 
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Fig.  5.3  Bootstrap  percentile  0.95  confidence  limits  for  ranks  of  predictors  in  an  OLS 
model.  Ranking  is  on  the  basis  of  partial  x2  minus  d.f.  Point  estimates  are  original 
ranks 


the  predictors.  Confidence  intervals  are  wide;  for  example  the  0.95  confidence 
interval  for  the  rank  of  x7  (which  has  a  true  rank  of  7)  is  [1,8],  so  we  are 
only  confident  that  x7  is  not  one  of  the  4  most  influential  predictors.  The 
confidence  intervals  do  include  the  true  ranks  in  each  case  (Figure  5.3). 


5.5  Simplifying  the  Final  Model  by  Approximating  It 
5.5.1  Difficulties  Using  Full  Models 

A  model  that  contains  all  prespecified  terms  will  usually  be  the  one  that  pre¬ 
dicts  the  most  accurately  on  new  data.  It  is  also  a  model  for  which  confidence 
limits  and  statistical  tests  have  the  claimed  properties.  Often,  however,  this 
model  will  not  be  very  parsimonious.  The  full  model  may  require  more  pre¬ 
dictors  than  the  researchers  care  to  collect  in  future  samples.  It  also  requires 
predicted  values  to  be  conditional  on  all  of  the  predictors,  which  can  increase 
the  variance  of  the  predictions. 

As  an  example  suppose  that  least  squares  has  been  used  to  fit  a  model 
containing  several  variables  including  race  (with  four  categories).  Race  may 
be  an  insignificant  predictor  and  may  explain  a  tiny  fraction  of  the  observed 
variation  in  Y.  Yet  when  predictions  are  requested,  a  value  for  race  must  be 
inserted.  If  the  subject  is  of  the  majority  race,  and  this  race  has  a  majority  of, 
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say  0.75,  the  variance  of  the  predicted  value  will  not  be  significantly  greater 
than  the  variance  for  a  predicted  value  from  a  model  that  excluded  race 
for  its  list  of  predictors.  If,  however,  the  subject  is  of  a  minority  race  (say 
“other”  with  a  prevalence  of  0.01),  the  predicted  value  will  have  much  higher 
variance.  One  approach  to  this  problem,  that  does  not  require  development 
of  a  second  model,  is  to  ignore  the  subject’s  race  and  to  get  a  weighted 
average  prediction.  That  is,  we  obtain  predictions  for  each  of  the  four  races 
and  weight  these  predictions  by  the  relative  frequencies  of  the  four  races. d 
This  weighted  average  estimates  the  expected  value  of  Y  unconditional  on 
race.  It  has  the  advantage  of  having  exactly  correct  confidence  limits  when 
model  assumptions  are  satisfied,  because  the  correct  “error  term”  is  being 
used  (one  that  deducts  3  d.f.  for  having  ever  estimated  the  race  effect).  In 
regression  models  having  nonlinear  link  functions,  this  process  does  not  yield 
such  a  simple  interpretation. 

When  predictors  are  collinear,  their  competition  results  in  larger  P- values 
when  predictors  are  (often  inappropriately)  tested  individually.  Likewise,  con¬ 
fidence  intervals  for  individual  effects  will  be  wide  and  uninterpretable  (can 
other  variables  really  be  held  constant  when  one  is  changed?). 


5.5.2  Approximating  the  Full  Model 

When  the  full  model  contains  several  predictors  that  do  not  appreciably  af¬ 
fect  the  predictions,  the  above  process  of  “unconditioning”  is  unwieldy.  In  the 
search  for  a  simple  solution,  the  most  commonly  used  procedure  for  making 
the  model  parsimonious  is  to  remove  variables  on  the  basis  of  P-values,  but 
this  results  in  a  variety  of  problems  as  we  have  seen.  Our  approach  instead 
is  to  consider  the  full  model  fit  as  the  “gold  standard”  model,  especially  the 
model  from  which  formal  inferences  are  made.  We  then  proceed  to  approxi¬ 
mate  this  full  model  to  any  desired  degree  of  accuracy.  For  any  approximate 
model  we  calculate  the  accuracy  with  which  it  approximates  the  best  model. 
One  goal  this  process  accomplishes  is  that  it  provides  different  degrees  of 
parsimony  to  different  audiences,  based  on  their  needs.  One  investigator  may 
be  able  to  collect  only  three  variables,  another  one  seven.  Each  investigator 
will  know  how  much  she  is  giving  up  by  using  a  subset  of  the  predictors. 
In  approximating  the  gold  standard  model  it  is  very  important  to  note  that 
there  is  nothing  gained  in  removing  certain  nonlinear  terms;  gains  in  parsi¬ 
mony  come  only  from  removing  entire  predictors.  Another  accomplishment 
of  model  approximation  is  that  when  the  full  model  has  been  fitted  using 


d  Using  the  rms  package  described  in  Chapter  6,  such  estimates  and  their 
confidence  limits  can  easily  be  obtained,  using  for  example  contrast  (fit , 
list(age=50,  disease= 5 hypertension 5 ,  race=levels (race) ) ,  type= 5 average 5 , 
weights=table (race) ) . 
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shrinkage  (penalized  estimation,  Section  9.10),  the  approximate  models  will 
inherit  the  shrinkage  (see  Section  14.10  for  an  example). 

Approximating  complex  models  with  simpler  ones  has  been  used  to  de¬ 
code  “black  boxes”  such  as  artificial  neural  networks.  Recursive  partitioning 
trees  (Section  2.5)  may  sometimes  be  used  in  this  context.  One  develops  a 
regression  tree  to  predict  the  predicted  value  X/3  on  the  basis  of  the  unique 
variables  in  X,  using  i?2,  the  average  absolute  prediction  error,  or  the  max¬ 
imum  absolute  prediction  error  as  a  stopping  rule,  for  example184.  The  user 
desiring  simplicity  may  use  the  tree  to  obtain  predicted  values,  using  the  first 
k  nodes,  with  k  just  large  enough  to  yield  a  low  enough  absolute  error  in  pre¬ 
dicting  the  more  comprehensive  prediction.  Overfitting  is  not  a  problem  as  it 
is  when  the  tree  procedure  is  used  to  predict  the  outcome,  because  (1)  given 
the  predictor  values  the  predictions  are  deterministic  and  (2)  the  variable  be¬ 
ing  predicted  is  a  continuous,  completely  observed  variable.  Hence  the  best 
cross- validating  tree  approximation  will  be  one  with  one  subject  per  node. 
One  advantage  of  the  tree-approximation  procedure  is  that  data  collection 
on  an  individual  subject  whose  outcome  is  being  predicted  may  be  abbrevi¬ 
ated  by  measuring  only  those  As  that  are  used  in  the  top  nodes,  until  the 
prediction  is  resolved  to  within  a  tolerable  error. 

When  principal  component  regression  is  being  used,  trees  can  also  be  used 
to  approximate  the  components  or  to  make  them  more  interpretable. 

Full  models  may  also  be  approximated  using  least  squares  as  long  as  the 

/\ 

linear  predictor  X/3  is  the  target,  and  not  some  nonlinear  transformation  of 
it  such  as  a  logistic  model  probability.  When  the  original  model  was  fitted 
using  unpenalized  least  squares,  submodels  fitted  against  Y  will  have  the  same 
coefficients  as  if  least  squares  had  been  used  to  fit  the  subset  of  predictors 
directly  against  Y .  To  see  this,  note  that  if  X  denotes  the  entire  design  matrix 
and  T  denotes  a  subset  of  the  columns  of  X,  the  coefficient  estimates  for  the 
full  model  are  (X/X)_1X/V,  Y  =  X(X/X)~1X/Y,  estimates  for  a  reduced 
model  fitted  against  Y  are  (T  T)LT  Y ,  and  coefficients  fitted  against  Y  are 
(T/T)T/X(X/X)-1XW  which  can  be  shown  to  equal  (T'T^T'Y. 

When  least  squares  is  used  for  both  the  full  and  reduced  models,  the 
variance-covariance  matrix  of  the  coefficient  estimates  of  the  reduced  model  is 
(T'T)-1^2,  where  the  residual  variance  a2  is  estimated  using  the  full  model. 
When  a2  is  estimated  by  the  unbiased  estimator  using  the  d.f.  from  the 
full  model,  which  provides  the  only  unbiased  estimate  of  cr2,  the  estimated 
variance-covariance  matrix  of  the  reduced  model  will  be  appropriate  (unlike 
that  from  stepwise  variable  selection)  although  the  bootstrap  may  be  needed 
to  fully  take  into  account  the  source  of  variation  due  to  how  the  approximate 
model  was  selected. 

So  if  in  the  least  squares  case  the  approximate  model  coefficients  are  iden¬ 
tical  to  coefficients  obtained  upon  fitting  the  reduced  model  against  T,  how 
is  model  approximation  any  different  from  stepwise  variable  selection?  There 
are  several  differences,  in  addition  to  how  a2  is  estimated. 
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1.  When  the  full  model  is  approximated  by  a  backward  step-down  procedure 

/\ 

against  T,  the  stopping  rule  is  less  arbitrary.  One  stops  deleting  variables 
when  deleting  any  further  variable  would  make  the  approximation  inad¬ 
equate  (e.g.,  the  R2  for  predictions  from  the  reduced  model  against  the 

/\ 

original  Y  drops  below  0.95). 

2.  Because  the  stopping  rule  is  different  (i.e.,  is  not  based  on  P- values),  the 
approximate  model  will  have  a  different  number  of  predictors  than  an 
ordinary  stepwise  model. 

3.  If  the  original  model  used  penalization,  approximate  models  will  inherit 
the  amount  of  shrinkage  used  in  the  full  fit. 

Typically,  though,  if  one  performed  ordinary  backward  step-down  against  Y 
using  a  large  cutoff  for  a  (e.g.,  0.5),  the  approximate  model  would  be  very 
similar  to  the  step-down  model.  The  main  difference  would  be  the  use  of 
a  larger  estimate  of  a2  and  smaller  error  d.f.  than  are  used  for  the  ordinary 
step-down  approach  (an  estimate  that  pretended  the  final  reduced  model  was 
prespecified). 

When  the  full  model  was  not  fitted  using  least  squares,  least  squares  can 
still  easily  be  used  to  approximate  the  full  model.  If  the  coefficient  estimates 
from  the  full  model  are  R  estimates  from  the  approximate  model  are  ma- 
trix  contrasts  of  R  namely,  WR  where  W  =  (T  T)LT'X .  So  the  variance- 
covariance  matrix  of  the  reduced  coefficient  estimates  is  given  by 

WVW',  (5.2) 

/\ 

where  V  is  the  variance  matrix  for  R  See  Section  19.5  for  an  example.  Ambler 
et  al.  1  studied  model  simplification  using  simulation  studies  based  on  several 
clinical  datasets,  and  compared  it  with  ordinary  backward  stepdown  variable 
selection  and  with  shrinkage  methods  such  as  the  lasso  (see  Section  4.3).  They 
found  that  ordinary  backwards  variable  selection  can  be  competitive  when 
there  is  a  large  fraction  of  truly  irrelevant  predictors  (something  that  can  be 
difficult  to  know  in  advance).  Paul  et  al.485  found  advantages  to  modeling 
the  response  with  a  complex  but  reliable  approach,  and  then  developing  a 
parsimoneous  model  using  the  lasso  or  stepwise  variable  selection  against  Y. 
See  Section  11.7  for  a  case  study  in  model  approximation. 
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Gelman213  argues  that  continuous  variables  should  be  scaled  by  two  standard 
deviations  to  make  them  comparable  to  binary  predictors.  However  his  approach 
assumes  linearity  in  the  predictor  effect  and  assumes  the  prevalence  of  the  binary 
predictor  is  near  0.5.  John  Fox  [202,  p.  95]  points  out  that  if  two  predictors  are 
on  the  same  scale  and  have  the  same  impact  (e.g.,  years  of  employment  and 
years  of  education),  standardizing  the  coefficients  will  make  them  appear  to 
have  different  impacts. 
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Levine  et  al.401  have  a  compelling  argument  for  graphing  effect  ratios  on  a 
logarithmic  scale. 

Hankins254  is  a  definitive  reference  on  nomograms  and  has  multi-axis  examples 

of  historical  significance.  According  to  Hankins,  Maurice  d’Ocagne  could  be 

called  the  inventor  of  the  nomogram,  starting  with  alignment  diagrams  in  1884 

/ 

and  declaring  a  new  science  of  “nomography”  in  1899.  d’Ocagne  was  at  Ecole 
des  Ponts  et  Chaussees,  a  French  civil  engineering  school.  Julien  and  Hanley328 
have  a  nice  example  of  adding  axes  to  a  nomogram  to  estimate  the  absolute 
effect  of  a  treatment  estimated  using  a  Cox  proportional  hazards  model.  Kattan 
and  Marasco339  have  several  clinical  examples  and  explain  advantages  to  the 
user  of  nomograms  over  “black  box”  computerized  prediction. 

Graham  and  Clavel231  discuss  graphical  and  tabular  ways  of  obtaining  risk 
estimates,  van  Gorp  et  al.630  have  a  nice  example  of  a  score  chart  for  manually 
obtaining  estimates. 

Larsen  and  Merlo3  7  5  developed  a  similar  measure — the  median  odds  ratio.  Go- 
nen  and  Heller223  developed  a  c-index  that  like  g  is  a  function  of  the  covariate 
distribution. 

Booth  and  Sarkar61  have  a  nice  analysis  of  the  number  of  bootstrap  resamples 
needed  to  guarantee  with  0.95  confidence  that  a  variance  estimate  has  a  suf¬ 
ficiently  small  relative  error.  They  concentrate  on  the  Monte  Carlo  simulation 
error,  showing  that  small  errors  in  variance  estimates  can  lead  to  important 
differences  in  P- values.  Canty  et  al.91  provide  a  number  of  diagnostics  to  check 
the  reliability  of  bootstrap  calculations. 

There  are  many  variations  on  the  basic  bootstrap  for  computing  confidence 
limits.150, 178  See  Booth  and  Sarkar61  for  useful  information  about  choosing 
the  number  of  resamples.  They  report  the  number  of  resamples  necessary  to 
not  appreciably  change  P-values,  for  example.  Booth  and  Sarkar  propose  a 
more  conservative  number  of  resamples  than  others  use  (e.g.,  800  resamples) 
for  estimating  variances.  Carpenter  and  Bit  hell9  2  have  an  excellent  overview  of 
bootstrap  confidence  intervals,  with  practical  guidance.  They  also  have  a  good 
discussion  of  the  unconditional  nonparametric  bootstrap  versus  the  conditional 
semiparametric  bootstrap. 

Altman  and  Royston18  have  a  good  general  discussion  of  what  it  means  to 
validate  a  predictive  model,  including  issues  related  to  study  design  and  con¬ 
sideration  of  uses  to  which  the  model  will  be  put. 

An  excellent  paper  on  external  validation  and  generalizability  is  Justice  et  al.329. 
Bleeker  et  al.58  provide  an  example  where  internal  validation  is  misleading  when 
compared  with  a  true  external  validation  done  using  subjects  from  different 
centers  in  a  different  time  period.  Vergouwe  et  al.638  give  good  guidance  about 
the  number  of  events  needed  in  sample  used  for  external  validation  of  binary 
logistic  models. 

See  Picard  and  Berk505  for  more  about  data-splitting. 

In  the  context  of  variable  selection  where  one  attempts  to  select  the  set  of  vari¬ 
ables  with  nonzero  true  regression  coefficients  in  an  ordinary  regression  model, 
Shao565  demonstrated  that  leave-out-one  cross-validation  selects  models  that 
are  “too  large.”  Shao  also  showed  that  the  number  of  observations  held  back  for 
validation  should  often  be  larger  than  the  number  used  to  train  the  model.  This 
is  because  in  this  case  one  is  not  interested  in  an  accurate  model  (you  fit  the 
whole  sample  to  do  that),  but  an  accurate  estimate  of  prediction  error  is  manda¬ 
tory  so  as  to  know  which  variables  to  allow  into  the  final  model.  Shao  suggests 
using  a  cross-validation  strategy  in  which  approximately  n3/4  observations  are 
used  in  each  training  sample  and  the  remaining  observations  are  used  in  the 
test  sample.  A  repeated  balanced  or  Monte  Carlo  splitting  approach  is  used, 
and  accuracy  estimates  are  averaged  over  2 n  (for  the  Monte  Carlo  method) 
repeated  splits. 
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12]  Picard  and  Cook’s  Monte  Carlo  cross-validation  procedure506  is  an  improve¬ 
ment  over  ordinary  cross-validation. 

13]  The  randomization  method  is  related  to  Kipnis’  “chaotization  relevancy  princi- 
ple”34s  [n  which  one  chooses  between  two  models  by  measuring  how  far  each  is 
from  a  nonsense  model.  Tibshirani  and  Knight  also  use  a  randomization  method 
for  estimating  the  optimism  in  a  model  fit.611 

14  This  method  used  here  is  a  slight  change  over  that  presented  in  [172],  where 
Efron  wrote  predictive  accuracy  as  a  sum  of  per-observation  components  (such 
as  1  if  the  observation  is  classified  correctly,  0  otherwise).  Here  we  are  writing 
m  x  the  unitless  summary  index  of  predictive  accuracy  in  the  place  of  Efron’s 
sum  of  m  per-observation  accuracies  [416,  p.  613]. 

15]  See  [633]  and  [66,  Section  4]  for  insight  on  the  meaning  of  expected  optimism. 

16]  See  Copas,123  van  Houwelingen  and  le  Cessie  [633,  p.  1318],  Verweij  and  van 
Houwelingen,640  and  others631  for  other  methods  of  estimating  shrinkage  coef¬ 
ficients. 

if]  Efron1 7 2  developed  the  “.632”  estimator  only  for  the  case  where  the  index  being 
bootstrapped  is  estimated  on  a  per-observation  basis.  A  natural  generalization 
of  this  method  can  be  derived  by  assuming  that  the  accuracy  evaluated  on 
observation  i  that  is  omitted  from  a  bootstrap  sample  has  the  same  expectation 
as  the  accuracy  of  any  other  observation  that  would  be  omitted  from  the  sample. 
The  modified  estimate  of  eo  is  then  given  by 

B 

60  =  ^  WiTi,  (5.3) 

i  =  1 

where  Ti  is  the  accuracy  estimate  derived  from  fitting  a  model  on  the  it h  boot¬ 
strap  sample  and  evaluating  it  on  the  observations  omitted  from  that  bootstrap 
sample,  and  Wi  are  weights  derived  for  the  B  bootstrap  samples: 


Wi 


[bootstrap  sample  i  omits  observation  j] 
^bootstrap  samples  omitting  observation  j 


(5.4) 


Note  that  eo  is  undefined  if  any  observation  is  included  in  every  bootstrap 
sample.  Increasing  B  will  avoid  this  problem.  This  modified  “.632”  estimator 
is  easy  to  compute  if  one  assembles  the  bootstrap  sample  assignments  and 
computes  the  Wi  before  computing  the  accuracy  indexes  Ti.  For  large  n,  the  Wi 
approach  1/B  and  so  eo  becomes  equivalent  to  the  accuracy  computed  on  the 
observations  not  contained  in  the  bootstrap  sample  and  then  averaged  over  the 
B  repetitions. 

18  Efron  and  Tibshirani179  have  reduced  the  bias  of  the  “.632”  estimator  further 
with  only  a  modest  increase  in  its  variance.  Simulation  has,  however,  shown  no 
advantage  of  this  “.632+”  method  over  the  basic  optimism  bootstrap  for  most 
accuracy  indexes  used  in  logistic  models. 

19  van  Houwelingen  and  le  Cessie6 3:5  have  several  interesting  developments  in 
model  validation.  See  Breiman66  for  a  discussion  of  the  choice  of  X  for  which 
to  validate  predictions.  Steyerberg  et  al.587  present  simulations  showing  the 
number  of  bootstrap  samples  needed  to  obtain  stable  estimates  of  optimism  of 
various  accuracy  measures.  They  demonstrate  that  bootstrap  estimates  of  op¬ 
timism  are  nearly  unbiased  when  compared  with  simulated  external  estimates. 
They  also  discuss  problems  with  precision  of  estimates  of  accuracy,  especially 
when  using  external  validation  on  small  samples. 

20  Blettner  and  Sauerbrei  also  demonstrate  the  variability  caused  by  data-driven 
analytic  decisions.59  Chatheld100  has  more  results  on  the  effects  of  using  the 
data  to  select  the  model. 
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5.7  Problem 

Perform  a  simulation  study  to  understand  the  performance  of  various  internal 
validation  methods  for  binary  logistic  models.  Modify  the  R  code  below  in  at 
least  two  meaningful  ways  with  regard  to  covariate  distribution  or  number, 
sample  size,  true  regression  coefficients,  number  of  resamples,  or  number  of 
times  certain  strategies  are  averaged.  Interpret  your  findings  and  give  recom¬ 
mendations  for  best  practice  for  the  type  of  configuration  you  studied.  The 
R  code  from  this  assignment  may  be  downloaded  from  the  RMS  course  wiki 
page. 

For  each  of  200  simulations,  the  code  below  generates  a  training  sample 
of  200  observations  with  p  predictors  (p  =  15  or  30)  and  a  binary  response. 
The  predictors  are  independently  U(— 0.5,  0.5).  The  response  is  sampled  so 
as  to  follow  a  logistic  model  where  the  intercept  is  zero  and  all  regression 
coefficients  equal  0.5.  The  “gold  standard”  is  the  predictive  ability  of  the 
fitted  model  on  a  test  sample  containing  50,000  observations  generated  from 
the  same  population  model.  For  each  of  the  200  simulations,  several  validation 
methods  are  employed  to  estimate  how  the  training  sample  model  predicts 
responses  in  the  50,000  observations.  These  validation  methods  involve  fitting 
40  or  200  models  in  resamples. 

g-fold  cross-validation  is  done  using  the  command  validate  (f,  method= 
5  cross 5 ,  B=g)  using  the  rms  package.  This  was  repeated  and  averaged  using 
an  extra  loop,  shown  below. 

For  bootstrap  methods,  validate (f,  method=5boot 5  or  5.6325,  B=40  or 
B=200)  was  used.  method=5 .632 5  does  Efron’s  “.632”  method179,  labeled  632a  in 
the  output.  An  ad-hoc  modification  of  the  .632  method,  632b  was  also  done. 
Here  a  “bias-corrected”  index  of  accuracy  is  simply  the  index  evaluated  in  the 
observation  omitted  from  the  bootstrap  resample.  The  “gold  standard”  exter¬ 
nal  validations  were  obtained  from  the  val.prob  function  in  the  rms  package. 
The  following  indexes  of  predictive  accuracy  are  used: 

Dxy\  Somers’  rank  correlation  between  predicted  probability  that  Y  —  1  vs. 
the  binary  Y  values.  This  equals  2 (C  —  0.5)  where  C  is  the  “ROC  Area” 
or  concordance  probability. 

D:  Discrimination  index  —  likelihood  ratio  y2  divided  by  the  sample  size 

U :  Unreliability  index  —  unit  less  index  of  how  far  the  logit  calibration 

curve  intercept  and  slope  are  from  (0, 1) 

Q:  Logarithmic  accuracy  score  —  a  scaled  version  of  the  log-likelihood 

achieved  by  the  predictive  model 
Intercept:  Calibration  intercept  on  logit  scale 

Slope:  Calibration  slope  (slope  of  predicted  log  odds  vs.  true  log  odds) 

Accuracy  of  the  various  resampling  procedures  may  be  estimated  by  com¬ 
puting  the  mean  absolute  errors  and  the  root  mean  squared  errors  of  esti¬ 
mates  (e.g.,  of  Dxy  from  the  bootstrap  on  the  200  observations)  against  the 
“gold  standard”  (e.g.,  Dxy  for  the  fitted  200-observation  model  achieved  in 
the  50,000  observations). 
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require ( rms ) 


set  . 

seed 

(1) 

#  so 

can 

reproduce  results 

n 

<(— 

200 

#  Size  of  training 

s  amp  l  e 

reps 

200 

#  Simulations 

npop 

50000 

# 

Size  of  validation  gold  standard  sample 

methods 

c  ( 

'Boot 

40  1 

'  ,  '  Boot  200  '  ,  '  632a  40  1 

1  ,  ' 632a  200  '  , 

'  632b 

40  1 

'  ,  '  632b  200  '  ,  '  10-fold 

x  4  '  ,  '  4-fold  x  10  ' 

'10-fold  x  20',  '4-fold  x  50') 

R  expand  .  grid  (  sim  =  1 :  reps  , 

p  =  c ( 15 , 30)  , 

method  =  methods ) 

R$Dxy  <—  R$Intercept  <—  R$Slope  <—  R$D  R$U  <—  R$Q  <— 
R$repmeth  <—  R$B  NA 
R$n  <—  n 

##  Function  to  do  r  overall  reps  of  B  resamples  ,  averaging  to 
##  get  estimates  similar  to  as  if  r*B  resamples  were  done 

val  <—  f  unct  i  on  ( f  it  ,  method,  B,  r)  { 

contains  <—  function(m)  length ( grep (m ,  method))  >  0 
meth  if  (  contains  (' Boot  ')  )  'boot'  else 

if ( contains  (' fold ') )  '  cr o s s val i dat i on  '  else 

if ( contains  ('  632  ') )  '.632' 

z  4 —  0 

for(i  in  l:r)  z  <—  z  +  val idat e  ( f it  ,  method  =  meth  ,  B  =  B)[ 
c  (  "  Dxy  "  ,  "  Intercept  "  ,  "Slope  "  ,  "  D  "  ,  "  U  "  ,  "  Q  "  )  , 

' index. corrected  '] 

z/r 

} 


for(p  in  c(15,  30))  { 

##  For  each  p  create  the  true  betas ,  the  design  matrix, 
##  and  realizations  of  binary  y  in  the  gold  standard 
##  large  sample 

Beta  rep(.5,  p)  #  True  betas 
X  matrix ( runif  ( npop *p )  ,  nrow  =  npop)  -  0.5 

LX  matxv (X ,  Beta) 

Y  if  else  ( runif  ( npop )  <  plogis(LX),  1,  0) 

##  For  each  simulation  create  the  data  matrix  and 
##  realizations  of  y 
f  or  (  j  in  1 :  reps  )  { 

##  Make  training  sample 
x  matrix ( runif (n*p) ,  nrow=n)  -  0.5 

L  matxv (x ,  Beta) 

y  ifelse  (runif  (n)  <  plogis  (L)  ,  1,0) 

f  lrm(y  x,  x  =  TRUE  ,  y  =  TRUE) 

beta  f$coef 
forecast  matxv (X ,  beta) 

##  Validate  in  population 
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V 

<r-  val . prob ( logit = f orecast  ,  y  =  Y, 

pi  =  FALSE )  [ 

c ( " Dxy  "  ,  " 

Intercept  " 

, "Slope  "  , "D"  , "U"  ,  "Q"  )] 

for(method  in  methods)  { 

repmeth 

1 

if (method 

°/0  i  n  °/0  c  (  '  B  o  o  t 

40  '  ,  '  632a 

40  '  ,  '632b  40  ')) 

o 

i 

PQ 

if (method 

°/0  i  n  °/0  c  (  '  B  o  o  t 

200  '  ,  '  632a 

200 ' ,  ' 632b  200  '  ) ) 

B  <-  200 

if (method 

==  ' 10 -f old  x 

4')  { 

B  ^  10 

repmeth 

} 

if (method 

<-  4 

==  '4-fold  x 

10  1  )  { 

PQ 

repmeth 

} 

if (method 

t 

m- 

o 

==  ' 10 -f old  x 

20  ')  { 

B  <-  10 

repmeth 

} 

if (method 

<-  20 

==  '4-fold  x 

50  1  )  { 

PQ 

repmeth 

} 

<-  50 

z  val  (f  ,  method,  B, 

repmeth ) 

k  which (R$sim  ==  j  & 

R$p  ==  p 

&  R$method  ==  method) 

if (length (k)  !=  1)  st op (' program 

logic  error  '  ) 

R [k ,  names 

(z)]  Z  -  V 

R [k ,  c  (  '  B  ' 

,  ' repmeth  '  )  ] 

•\ 

PQ 

II 

PQ 

O 

1 

r epmeth  =  repmeth ) 

} 

#  end  over 

methods 

} 

# 

end  over  reps 

}  # 

end  over  p 

Results  are  best  summarized  in  a  multi-way  dot  chart.  Bootstrap  nonpara- 
metric  percentile  0.95  confidence  limits  are  included. 

L 

statnames  names (R)  [6 : 1 1] 

w  reshape (R ,  dir e ct i on =  '  long  '  ,  vary ing  =  list ( statnames )  , 

v . name s =  '  x  '  ,  t imevar =  1  stat  1  ,  times  =  statnames ) 
w$p  paste ('p',  w$p ,  sep='=') 

require (lattice) 

s  with(w,  summarize ( abs (x) ,  llist(p,  method,  stat), 

smean.cl.boot  , stat . name  =' mae  ') ) 

Dotplot (method  ~  Cbind (mae ,  Lower,  Upper)  |  stat*p,  data=s , 
xlab  =  1  Mean  | error |  ') 

s  with(w,  summarize (xA2 ,  llist(p,  method,  stat), 

smean.cl.boot,  stat.name  =  'mse  ')) 
Dotplot (method  ~  Cbind ( sqrt (mse ) ,  sqrt (Lower),  sqrt (Upper))  | 
stat  *p  ,  dat a  =  s , 
xlab=expression (sqrt (MSE))) 


Chapter  6 
R  Software 


The  methods  described  in  this  book  are  useful  in  any  regression  model  that 
involves  a  linear  combination  of  regression  parameters.  The  software  that  is 
described  below  is  useful  in  the  same  situations.  Functions  in  R  20  allow  inter¬ 
action  spline  functions  as  well  as  a  wide  variety  of  predictor  parameterizations 
for  any  regression  function,  and  facilitate  model  validation  by  resampling. 

R  is  the  most  comprehensive  tool  for  general  regression  models  for  the 
following  reasons. 

1.  It  is  very  easy  to  write  R  functions  for  new  models,  so  R  has  implemented 
a  wide  variety  of  modern  regression  models. 

2.  Designs  can  be  generated  for  any  model.  There  is  no  need  to  find  out 
whether  the  particular  modeling  function  handles  what  SAS  calls  “class” 
variables — dummy  variables  are  generated  automatically  when  an  R  cate¬ 
gory,  factor,  ordered,  or  character  variable  is  analyzed. 

3.  A  single  R  object  can  contain  all  information  needed  to  test  hypotheses 
and  to  obtain  predicted  values  for  new  data. 

4.  R  has  superior  graphics. 

5.  Classes  in  R  make  possible  the  use  of  generic  function  names  (e.g.,  predict, 
summary,  anova)  to  examine  fits  from  a  large  set  of  specific  model-fitting 
functions. 

r  a,  601, 635  -g  a  high-level  object-oriented  language  for  statistical  anal¬ 
ysis  with  over  six  thousand  packages  and  tens  of  thousands  of  functions 
available.  The  R  system  18,520  is  the  basis  for  R  software  used  in  this 
text,  centered  around  the  Regression  Modeling  Strategies  (rms)  package261. 
See  the  Appendix  and  the  Web  site  for  more  information  about  software 
implementations. 
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6.1  The  R  Modeling  Language 

R  has  a  battery  of  functions  that  make  up  a  statistical  modeling  language.96 


response  ~  terms 

The  terms  represent  additive  components  of  a  general  linear  model.  Although 
variables  and  functions  of  variables  make  up  the  terms,  the  formula  refers 
to  additive  combinations;  for  example,  when  terms  is  age  +  blood. pressure, 
it  refers  to  pi  x  age  +  p2  x  blood. pressure.  Some  examples  of  formulas  are 
below. 

L 

y  age  +  sex  #  age  +  sex  main  effects 

y  age  +  sex  +  age : sex  #  add  second-order  interaction 

y  r^>  age*sex  #  second-order  interaction  + 

#  all  main  effects 
y  (age  +  sex  +  pressure)  A2 

#  ag  e  +  s  ex+pres  sure  +  ag  e : sex  +  age : pressure .  .  . 
y  ~  (age  +  sex  +  pressure)A2  -  sex : pressure 

#  all  main  effects  and  all  2nd  order 

#  interactions  except  sex ’.pressure 

y  rsj  (age  +  race)*sex  #  age  +  race+sex  +  age:sex  +  race:sex 

y  ~  treatment *( age  * race  +  age*sex) 

#  no  interact,  with  race, sex 

sqrt  (y)  ~  sex  *  sqrt  (  age  )  +  race 

#  functions,  with  dummy  variables  generated  if 

#  race  is  an  R  factor  ( classification )  variable 

y  sex  +  poly(age,2)#  poly  makes  orthogonal  polynomials 
race. sex  interact  ion  (race  ,  sex) 

y  ~  age  +  race. sex  #  if  desire  dummy  variables  for  all 

#  combinations  of  the  factors 

The  formula  for  a  regression  model  is  given  to  a  modeling  function;  for 
example, 

L 

lrm  (y  3T  C  S  (x , 4) ) 

is  read  “use  a  logistic  regression  model  to  model  y  as  a  function  of  x,  repre¬ 
senting  x  by  a  restricted  cubic  spline  with  four  default  knots. ”a  You  can  use 
the  R  function  update  to  refit  a  model  with  changes  to  the  model  terms  or  the 
data  used  to  fit  it: 


f 

lrm  ( y 

~  res  (x  ,4)  +  x2  +  x3) 

i 

f  2 

update 

(f  ,  subset  =  sex  =="  male  " 

) 

f  3 

update 

(f,  .~.-x2)  # 

remove  x2  from  model 

f  4 

update 

(f  ,  +  res (x5  ,  5) ) # 

add  rcs(x5,5)  to  model 

f  5 

update 

(f ,  y2  ~  .  )  #  same 

terms,  new  response  var. 

At  the  heart  of  the  modeling  functions  is  an  R  formula  of  the  form 


lrm  and  res  are  in  the  rms  package. 


a 
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6.2  User-Contributed  Functions 

In  addition  to  the  many  functions  that  are  packaged  with  R,  a  wide  variety 
of  user-contributed  functions  is  available  on  the  Internet  (see  the  Appendix 
or  Web  site  for  addresses).  Two  packages  of  functions  used  extensively  in 
this  text  are  Hmisc  0  and  rms  written  by  the  author.  The  Hmisc  package  con¬ 
tains  miscellaneous  functions  such  as  varclus,  spearman2,  transcan,  hoeffd, 
rcspline . eval,  impute,  cut2,  describe,  sas.get,  latex,  and  several  power  and 
sample  size  calculation  functions.  The  varclus  function  uses  the  R  hclust  hi¬ 
erarchical  clustering  function  to  do  variable  clustering,  and  the  R  piclust 
function  to  draw  dendrograms  depicting  the  clusters,  varclus  offers  a  choice 
of  three  similarity  measures  (Pearson  r2,  Spearman  p2,  and  Hoeffding  D ) 
and  uses  pairwise  deletion  of  missing  values,  varclus  automatically  generates 
a  series  of  dummy  variables  for  categorical  factors.  The  Hmisc  hoeffd  function 
computes  a  matrix  of  Hoeffding  D s  for  a  series  of  variables.  The  spearman2 
function  will  do  Wilcoxon,  Spearman,  and  Kruskal- Wallis  tests  and  general¬ 
izes  Spearman’s  p  to  detect  non-monotonic  relationships. 

Hmisc’s  transcan  function  (see  Section  4.7)  performs  a  similar  function  to 
PROC  PRINQUAL  in  SAS — it  uses  restricted  splines,  dummy  variables,  and  canon¬ 
ical  variates  to  transform  each  of  a  series  of  variables  while  imputing  missing 
values.  An  option  to  shrink  regression  coefficients  for  the  imputation  models 
avoids  overfitting  for  small  samples  or  a  large  number  of  predictors,  transcan 
can  also  do  multiple  imputation  and  adjust  variance-covariance  matrices  for 
imputation.  See  Chapter  8  for  an  example  of  using  these  functions  for  data 
reduction. 

See  the  Web  site  for  a  list  of  R  functions  for  correspondence  analysis, 
principal  component  analysis,  and  missing  data  imputation  available  from 
other  users.  Venables  and  Ripley  [635,  Chapter  11]  provide  a  nice  description 
of  the  multivariate  methods  that  are  available  in  R,  and  they  provide  several 
new  multivariate  analysis  functions. 

A  basic  function  in  Hmisc  is  the  rcspline. eval  function,  which  creates  a 
design  matrix  for  a  restricted  (natural)  cubic  spline  using  the  truncated  power 
basis.  Knot  locations  are  optionally  estimated  using  methods  described  in 
Section  2.4.6,  and  two  types  of  normalizations  to  reduce  numerical  problems 
are  supported.  You  can  optionally  obtain  the  design  matrix  for  the  anti- 
derivative  of  the  spline  function.  The  rcspline. restate  function  computes 
the  coefficients  (after  un- normalizing  if  needed)  that  translate  the  restricted 
cubic  spline  function  to  unrestricted  form  (Equation  2.27).  rcspline. restate 
also  outputs  ETuX  and  R  representations  of  spline  functions  in  simplified 
form. 


130 


6  R  Software 


6.3  The  rms  Package 

A  package  of  R  functions  called  rms  contains  several  functions  that  extend 
R  to  make  the  analyses  described  in  this  book  easy  to  do.  A  central  func¬ 
tion  in  rms  is  datadist,  which  computes  statistical  summaries  of  predictors  to 
automate  estimation  and  plotting  of  effects,  datadist  exists  as  a  separate  func¬ 
tion  so  that  the  candidate  predictors  may  be  summarized  once,  thus  saving 
time  when  fitting  several  models  using  subsets  or  different  transformations  of 
predictors.  If  datadist  is  called  before  model  fitting,  the  distributional  sum¬ 
maries  are  stored  with  the  fit  so  that  the  fit  is  self-contained  with  respect 
to  later  estimation.  Alternatively,  datadist  may  be  called  after  the  fit  to  cre¬ 
ate  temporary  summaries  to  use  as  plot  ranges  and  effect  intervals,  or  these 
ranges  may  be  specified  explicitly  to  Predict  and  summary  (see  below),  without 
ever  calling  datadist.  The  input  to  datadist  may  be  a  data  frame,  a  list  of 
individual  predictors,  or  a  combination  of  the  two. 

The  characteristics  saved  by  datadist  include  the  overall  range  and  certain 
quantiles  for  continuous  variables,  and  the  distinct  values  for  discrete  vari¬ 
ables  (i.e. ,  R  factor  variables  or  variables  with  10  or  fewer  unique  values).  The 
quantiles  and  set  of  distinct  values  facilitate  estimation  and  plotting,  as  de¬ 
scribed  later.  When  a  function  of  a  predictor  is  used  (e.g.,  pol(pmin(x,50)  ,2)), 
the  limits  saved  apply  to  the  innermost  variable  (here,  x).  When  a  plot  is  re¬ 
quested  for  how  x  relates  to  the  response,  the  plot  will  have  x  on  the  x-axis, 
not  pmin(x,50).  The  way  that  defaults  are  computed  can  be  controlled  by 
the  q. effect  and  q. display  parameters  to  datadist.  By  default,  continuous 
variables  are  plotted  with  ranges  determined  by  the  tenth  smallest  and  tenth 
largest  values  occurring  in  the  data  (if  n  <  200,  the  0.05  and  0.95  quantiles 
are  used).  The  default  range  for  estimating  effects  such  as  odds  and  hazard 
ratios  is  the  lower  and  upper  quart iles.  When  a  predictor  is  adjusted  to  a 
constant  so  that  the  effects  of  changes  in  other  predictors  can  be  studied,  the 
default  constant  used  is  the  median  for  continuous  predictors  and  the  most 
frequent  category  for  factor  variables.  The  R  system  option  datadist  is  used 
to  point  to  the  result  returned  by  the  datadist  function.  See  the  help  files  for 
datadist  for  more  information. 

rms  fitting  functions  save  detailed  information  for  later  prediction,  plotting, 
and  testing,  rms  also  allows  for  special  restricted  interactions  and  sets  the 
default  method  of  generating  contrasts  for  categorical  variables  to  "contr.- 
treatment",  the  traditional  dummy- variable  approach. 

rms  has  a  special  operator  0/0ia0/0  in  the  terms  of  a  formula  that  allows  for 
restricted  interactions.  For  example,  one  may  specify  a  model  that  contains 
sex  and  a  five-knot  linear  spline  for  age,  but  restrict  the  age  x  sex  interaction 
to  be  linear  in  age.  To  be  able  to  connect  this  incomplete  interaction  with  the 
main  effects  for  later  hypothesis  testing  and  estimation,  the  following  formula 
would  be  given: 

y  ~  sex  +  lsp  (  age  ,  c  (20 , 30 , 40 , 50 , 60)  )  + 

sex  0/0ia°/o  lsp  (  age  ,  c  (20 , 30 , 40 , 50 , 60)  ) 
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Table  6.1  rms  Fitting  Functions 


Function 

Purpose 

Related  R 
Functions 

ols 

Ordinary  least  squares  linear  model 

lm 

lrm 

Binary  and  ordinal  logistic  regression  model 
Has  options  for  penalized  MLE 

glm 

orm 

Ordinal  semi-parametric  regression  model  with 
several  link  functions 

polr,lrm 

psm 

Accelerated  failure  time  parametric  survival 
models 

survreg 

eph 

Cox  proportional  hazards  regression 

coxph 

bj 

Buckley- James  censored  least  squares  model 

survreg, lm 

Glm 

General  linear  models 

glm 

Gls 

Generalized  least  squares 

gls 

Rq 

Quantile  regression 

rq 

The  following  expression  would  restrict  the  age  x  cholesterol  interaction  to 
be  of  the  form  AF(B)  +  BG(A)  by  removing  doubly  nonlinear  terms. 

L 

y  ~  lsp(age,30)  +  res ( cholesterol  ,4)  + 

lsp(age,30)  0/0ia0/0  res  (cholesterol  ,4) 

rms  has  special  fitting  functions  that  facilitate  many  of  the  procedures  de¬ 
scribed  in  this  book,  shown  in  Table  6.1. 

Glm  is  a  slight  modification  of  the  built-in  R  glm  function  so  that  rms  meth¬ 
ods  can  be  run  on  the  resulting  fit  object,  glm  fits  general  linear  models  under 
a  wide  variety  of  distributions  of  Y.  Gls  is  a  modification  of  the  gls  function 
from  the  nlme  package  of  Pinheiro  and  Bates509,  for  repeated  measures  (longi¬ 
tudinal)  and  spatially  correlated  data.  The  Rq  function  is  a  modification  of  the 
quantreg  package’s  rq  function356,357.  Functions  related  to  survival  analysis 
make  heavy  use  of  Therneau’s  survival  package482. 

You  may  want  to  specify  to  the  fitting  functions  an  option  for  how  missing 
values  (NAs)  are  handled.  The  method  for  handling  missing  data  in  R  is  to 
specify  an  na.  action  function.  Some  possible  na.  actions  are  given  in  Table  6.2. 
The  default  na. action  is  na. delete  when  you  use  rms’s  fitting  functions.  An 
easy  way  to  specify  a  new  default  na. action  is,  for  example, 

L 

options (na. action="na. omit " ) #  don't  report  frequency  of  NAs 

before  using  a  fitting  function.  If  you  use  na.  delete  you  can  also  use  the  system 
option  na. detail. response  that  makes  model  fits  store  information  about  Y 
stratified  by  whether  each  X  is  missing.  The  default  descriptive  statistics  for 
Y  are  the  sample  size  and  mean.  For  a  survival  time  response  object  the 
sample  size  and  proportion  of  events  are  used.  Other  summary  functions  can 
be  specified  using  the  na. fun. response  option. 
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Table  6.2  Some  na. actions  Used  in  rms 

Function  Name  Method  Used 

na. f ail  Stop  with  error  message  if  any  missing 

values  present 

na. omit  Function  to  remove  observations  with 

any  predictors  or  responses  missing 
na. delete  Modified  version  of  na.omit  to  also 

report  on  frequency  of  NAs  for  each 
variable 


opt  i 

ons 

(  na . 

act 

ion="na. del 

ete " ,  na . 

det  a 

i  1  . 

res 

pons e  =TRUE  , 

na . 

fun 

.response=" 

my st  at  s  "  ) 

# 

Ju 

s  t 

us  e 

na . 

fun . respons 

e =" quant i 

le  " 

if 

don 

'  t  care  ab out  n 

my  st 

at  s 

fun 

ct ion (y )  { 

z 

<— 

quant i 1 

e (y ,  na . rm  = 

T) 

n 

sum  ( 

!  is 

. na (y ) ) 

c  ( 

N  =  n 

,  z) 

#  elements 

named  N, 

o°/0, 

25%, 

etc . 

} 

When  R  deletes  missing  values  during  the  model-fitting  procedure,  residuals, 
fitted  values,  and  other  quantities  stored  with  the  fit  will  not  correspond  row- 
for-row  with  observations  in  the  original  data  frame  (which  retained  NAs) .  This 
is  problematic  when,  for  example,  age  in  the  dataset  is  plotted  against  the 
residual  from  the  fitted  model.  Fortunately,  for  many  na. actions  including 
na. delete  and  a  modified  version  of  na.omit,  a  class  of  R  functions  called 
naresid  written  by  Therneau  works  behind  the  scenes  to  put  NAs  back  into 
residuals,  predicted  values,  and  other  quantities  when  the  predict  or  residuals 
functions  (see  below)  are  used.  Thus  for  some  of  the  na. actions,  predicted 
values  and  residuals  will  automatically  be  arranged  to  match  the  original 
data. 

Any  R  function  can  be  used  in  the  terms  for  formulas  given  to  the  fit¬ 
ting  function,  but  if  the  function  represents  a  transformation  that  has  data- 
dependent  parameters  (such  as  the  standard  R  functions  poly  or  ns),  R  will 
not  in  general  be  able  to  compute  predicted  values  correctly  for  new  obser¬ 
vations.  For  example,  the  function  ns  that  automatically  selects  knots  for  a 
B-spline  fit  will  not  be  conducive  to  obtaining  predicted  values  if  the  knots 
are  kept  “secret.”  For  this  reason,  a  set  of  functions  that  keep  track  of  trans¬ 
formation  parameters,  exists  in  rms  for  use  with  the  functions  highlighted 
in  this  book.  These  are  shown  in  Table  6.3.  Of  these  functions,  asis,  catg, 
scored,  and  matrx  are  almost  always  called  implicitly  and  are  not  mentioned 
by  the  user,  catg  is  usually  called  explicitly  when  the  variable  is  a  numeric 
variable  to  be  used  as  a  polytomous  factor,  and  it  has  not  been  converted  to 
an  R  categorical  variable  using  the  factor  function. 
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Table  6.3  rms  Transformation  Functions 


Function 

Purpose 

Related  R 
Functions 

asis 

No  post-transformation  (seldom  used  explicitly) 

I 

res 

Restricted  cubic  spline 

ns 

pol 

Polynomial  using  standard  notation 

poly 

lsp 

Linear  spline 

catg 

Categorical  predictor  (seldom) 

factor 

scored 

Ordinal  categorical  variables 

ordered 

matrx 

Keep  variables  as  group  for  anova  and  fastbw 

matrix 

strat 

Nonmodeled  stratification  factors 
(used  for  eph  only) 

strata 

These  functions  can  be  used  with  any  function  of  a  predictor.  For  example, 
to  obtain  a  four-knot  cubic  spline  expansion  of  the  cube  root  of  x,  specify 
res  (xA  (1/3)  ,4)  . 

When  the  transformation  functions  are  called,  they  are  usually  given  one 
or  two  arguments,  such  as  rcs(x,5).  The  first  argument  is  the  predictor  vari¬ 
able  or  some  function  of  it.  The  second  argument  is  an  optional  vector  of 
parameters  describing  a  transformation,  for  example  location  or  number  of 
knots.  Other  arguments  may  be  provided. 

The  Hmisc  package’s  cut2  function  is  sometimes  used  to  create  a  categorical 
variable  from  a  continuous  variable  x.  You  can  specify  the  actual  interval 
endpoints  (cuts),  the  number  of  observations  to  have  in  each  interval  on 
the  average  (m),  or  the  number  of  quantile  groups  (g).  Use,  for  example, 
cuts=c (0,1,2)  to  cut  into  the  intervals  [0, 1),  [1,2]. 

A  key  concept  in  fitting  models  in  R  is  that  the  fitting  function  returns  an 
object  that  is  an  R  list.  This  object  contains  basic  information  about  the  fit 
(e.g.,  regression  coefficient  estimates  and  covariance  matrix,  model  y2)  as  well 
as  information  about  how  each  parameter  of  the  model  relates  to  each  factor 
in  the  model.  Components  of  the  fit  object  are  addressed  by,  for  example, 
f  it$coef ,  f  it$var ,  fit$loglik.  rms  causes  the  following  information  to  also 
be  retained  in  the  fit  object:  the  limits  for  plotting  and  estimating  effects 
for  each  factor  (if  options (datadist="name")  was  in  effect),  the  label  for  each 
factor,  and  a  vector  of  values  indicating  which  parameters  associated  with  a 
factor  are  nonlinear  (if  any).  Thus  the  “fit  object”  contains  all  the  information 
needed  to  get  predicted  values,  plots,  odds  or  hazard  ratios,  and  hypothesis 
tests,  and  to  do  “smart”  variable  selection  that  keeps  parameters  together 
when  they  are  all  associated  with  the  same  predictor. 

R  uses  the  notion  of  the  class  of  an  object.  The  object-oriented  class  idea 
allows  one  to  write  a  few  generic  functions  that  decide  which  specific  func¬ 
tions  to  call  based  on  the  class  of  the  object  passed  to  the  generic  function. 
An  example  is  the  function  for  printing  the  main  results  of  a  logistic  model. 
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The  lrm  function  returns  a  fit  object  of  class  "lrm".  If  you  specify  the  R  com¬ 
mand  print  (fit)  (or  just  fit  if  using  R  interactively — this  invokes  print),  the 
print  function  invokes  the  print .  lrm  function  to  do  the  actual  printing  specific 
to  logistic  models.  To  find  out  which  particular  methods  are  implemented  for 
a  given  generic  function,  type  methods  (generic  .name) . 

Generic  functions  that  are  used  in  this  book  include  those  in  Table  6.4. 


Table  6.4  rms  Package  and  R  Generic  Functions 


Function 

print 

coef 

formula 

specs 

vcov 

logLik 

AIC 

lrtest 

univarLR 

robcov 

bootcov 

pentrace 

effective . df 

summary- 
plot  . summary 

anova 

plot . anova 

contrast 

Predict 


plot . Predict 

ggplot 

bplot 

gendata 
predict 
f astbw 
residuals 
sensuc 

which. influence 
latex 


Purpose  Related  Functions 

Print  parameters  and  statistics  of  fit 
Fitted  regression  coefficients 
Formula  used  in  the  fit 
Detailed  specifications  of  fit 
Fetch  covariance  matrix 
Fetch  maximized  log-likelihood 
Fetch  AIC 

Likelihood  ratio  test  for  two  nested  models 
Compute  all  univariable  LR  y2 
Robust  covariance  matrix  estimates 
Bootstrap  covariance  matrix  estimates 
and  bootstrap  distributions  of  estimates 
Find  optimum  penalty  factors  by  tracing 
effective  AIC  for  a  grid  of  penalties 
Print  effective  d.f.  for  each  type  of  variable 
in  model,  for  penalized  fit  or  pentrace  result 
Summary  of  effects  of  predictors 
Plot  continuously  shaded  confidence  bars 
for  results  of  summary 

Wald  tests  of  most  meaningful  hypotheses 
Graphical  depiction  of  anova 
General  contrasts,  C.L.,  tests 
Predicted  values  and  confidence  limits  easily 
varying  a  subset  of  predictors  and  leaving  the 
rest  set  at  default  values 
Plot  the  result  of  Predict  using  lattice 
Plot  the  result  of  Predict  using  ggplot2 
3-dimensional  plot  when  Predict  varied 
two  continuous  predictors  over  a  fine  grid 
Easily  generate  predictor  combinations 
Obtain  predicted  values  or  design  matrix 
Fast  backward  step-down  variable  selection 
(or  resid)  Residuals,  influence  stats  from  fit 
Sensitivity  analysis  for  unmeasured 
confounder 

Which  observations  are  overly  influential 
IAIjyX  representation  of  fitted  model 

continued  on  next  page 


step 


residuals 

Function 
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Function 

Function 

Hazard 

Survival 

ExProb 

Quantile 


Mean 

nomogram 

survest 

survplot 

validate 

calibrate 

vif 

naresid 

naprint 

impute 


continued  from  previous  page 


Purpose 

- - - 

R  function  analytic  representation  of  X(3 

from  a  fitted  regression  model 
R  function  analytic  representation  of  a  fitted 
hazard  function  (for  psm) 

R  function  analytic  representation  of  fitted 
survival  function  (for  psm,  cph) 

R  function  analytic  representation  of 
exceedance  probabilities  for  orm 
R  function  analytic  representation  of  fitted 
function  for  quantiles  of  survival  time 
(for  psm,  cph) 

R  function  analytic  representation  of  fitted 
function  for  mean  survival  time  or  for  ordinal  logistic 
Draws  a  nomogram  for  the  fitted  model 
Estimate  survival  probabilities  (psm,  cph) 

Plot  survival  curves  (psm,  cph) 

Validate  indexes  of  model  fit  using  resampling 
Estimate  calibration  curve  using  resampling 
Variance  inflation  factors  for  fitted  model 
Bring  elements  corresponding  to  missing  data 
back  into  predictions  and  residuals 
Print  summary  of  missing  values 
Impute  missing  values 


Related  Functions 
latex 


latex,  plot 
survf it 
plot . survf it 

val . prob 


transcan 


The  first  argument  of  the  majority  of  functions  is  the  object  returned  from 
the  model  fitting  function.  When  used  with  ols,  lrm,  orm,  psm,  cph,  Glm,  Gls,  Rq, 
bj,  these  functions  do  the  following,  specs  prints  the  design  specifications,  for 
example,  number  of  parameters  for  each  factor,  levels  of  categorical  factors, 
knot  locations  in  splines,  and  so  on.  vcov  returns  the  variance-covariance 
matrix  for  the  model.  logLik  retrieves  the  maximized  log-likelihood,  whereas 
AIC  computes  the  Akaike  Information  Criterion  for  the  model  on  the  minus 
twice  log-likelihood  scale  (with  an  option  to  compute  it  on  the  y2  scale  if  you 
specify  type=5chisq5).  lrtest,  when  given  two  fit  objects  from  nested  models, 
computes  the  likelihood  ratio  test  for  the  extra  variables.  univarLR  computes 
all  univariable  likelihood  ratio  y2  statistics,  one  predictor  at  a  time. 

The  robcov  function  computes  the  Huber  robust  covariance  matrix  esti¬ 
mate.  bootcov  uses  the  bootstrap  to  estimate  the  covariance  matrix  of  pa¬ 
rameter  estimates.  Both  robcov  and  bootcov  assume  that  the  design  matrix 
and  response  variable  were  stored  with  the  fit.  They  have  options  to  adjust 
for  cluster  sampling.  Both  replace  the  original  variance-covariance  matrix 
with  robust  estimates  and  return  a  new  fit  object  that  can  be  passed  to  any 
of  the  other  functions.  In  that  way,  robust  Wald  tests,  variable  selection,  con¬ 
fidence  limits,  and  many  other  quantities  may  be  computed  automatically. 
The  functions  do  save  the  old  covariance  estimates  in  component  orig.var 
of  the  new  fit  object,  bootcov  also  optionally  returns  the  matrix  of  param¬ 
eter  estimates  over  the  bootstrap  simulations.  These  estimates  can  be  used 
to  derive  bootstrap  confidence  intervals  that  don’t  assume  normality  or  sym¬ 
metry.  Associated  with  bootcov  are  plotting  functions  for  drawing  histogram 
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and  smooth  density  estimates  for  bootstrap  distributions,  bootcov  also  has 
a  feature  for  deriving  approximate  nonparametric  simultaneous  confidence 
sets.  For  example,  the  function  can  get  a  simultaneous  0.90  confidence  region 
for  the  regression  effect  of  age  over  its  entire  range. 

The  pentrace  function  assists  in  selection  of  penalty  factors  for  fitting  re¬ 
gression  models  using  penalized  maximum  likelihood  estimation  (see  Sec¬ 
tion  9.10).  Different  types  of  model  terms  can  be  penalized  by  different 
amounts.  For  example,  one  can  penalize  interaction  terms  more  than  main 
effects.  The  effective.df  function  prints  details  about  the  effective  degrees 
of  freedom  devoted  to  each  type  of  model  term  in  a  penalized  fit. 

summary  prints  a  summary  of  the  effects  of  each  factor.  When  summary  is 
used  to  estimate  effects  (e.g.,  odds  or  hazard  ratios)  for  continuous  variables, 
it  allows  the  levels  of  interacting  factors  to  be  easily  set,  as  well  as  allowing 
the  user  to  choose  the  interval  for  the  effect.  This  method  of  estimating  effects 
allows  for  nonlinearity  in  the  predictor.  By  default,  interquartile  range  effects 
(differences  in  X/3,  odds  ratios,  hazards  ratios,  etc.)  are  printed  for  continuous 
factors,  and  all  comparisons  with  the  reference  level  are  made  for  categorical 
factors.  See  the  example  at  the  end  of  the  summary  documentation  for  a  method 
of  quickly  computing  pairwise  treatment  effects  and  confidence  intervals  for 
a  large  series  of  values  of  factors  that  interact  with  the  treatment  variable. 
Saying  plot  (summary  (fit) )  will  depict  the  effects  graphically,  with  bars  for  a 
list  of  confidence  levels. 

The  anova  function  automatically  tests  most  meaningful  hypotheses  in  a 
design.  For  example,  suppose  that  age  and  cholesterol  are  predictors,  and 
that  a  general  interaction  is  modeled  using  a  restricted  spline  surface,  anova 
prints  Wald  statistics  for  testing  linearity  of  age,  linearity  of  cholesterol,  age 
effect  (age  +  age  x  cholesterol  interaction),  cholesterol  effect  (cholesterol  + 
age  x  cholesterol  interaction),  linearity  of  the  age  x  cholesterol  interaction 
(i.e.,  adequacy  of  the  simple  age  x  cholesterol  1  d.f.  product),  linearity  of  the 
interaction  in  age  alone,  and  linearity  of  the  interaction  in  cholesterol  alone. 
Joint  tests  of  all  interaction  terms  in  the  model  and  all  nonlinear  terms  in  the 
model  are  also  performed.  The  plot .  anova  function  draws  a  dot  chart  showing 
the  relative  contribution  (y2,  y2  minus  d.f.,  AIC,  partial  R 2,  P- value,  etc.) 
of  each  factor  in  the  model. 

The  contrast  function  is  used  to  obtain  general  contrasts  and  correspond¬ 
ing  confidence  limits  and  test  statistics.  This  is  most  useful  for  testing  effects 
in  the  presence  of  interactions  (e.g.,  type  II  and  type  III  contrasts).  See  the 
help  file  for  contrast  for  several  examples  of  how  to  obtain  joint  tests  of  mul¬ 
tiple  contrasts  (see  Section  9.3.2)  as  well  as  double  differences  (interaction 
contrasts). 

The  predict  function  is  used  to  obtain  a  variety  of  values  or  predicted 
values  from  either  the  data  used  to  fit  the  model  or  a  new  dataset.  The 
Predict  function  is  easier  to  use  for  most  purposes,  and  has  a  special  plot 
method.  The  gendata  function  makes  it  easy  to  obtain  a  data  frame  containing 
predictor  combinations  for  obtaining  selected  predicted  values. 
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The  fastbw  function  performs  a  slightly  inefficient  but  numerically  stable 
version  of  fast  backward  elimination  on  factors,  using  a  method  based  on 
Lawless  and  Singhal.385  This  method  uses  the  fitted  complete  model  and 
computes  approximate  Wald  statistics  by  computing  conditional  (restricted) 
maximum  likelihood  estimates  assuming  multivariate  normality  of  estimates. 
It  can  be  used  in  simulations  since  it  returns  indexes  of  factors  retained  and 
dropped: 

L 

fit  ols (y  ~  xl*x2*x3) 

#  run,  and  print  results: 
fastbw (fit,  opt ional.argument s ) 

#  typically  used  in  simulations: 
z  fastbw (fit,  opt ional.args ) 

#  least  squares  fit  of  reduced  model: 
lm . f it (X [ , z$parms . kept ] ,  Y) 

fastbw  deletes  factors,  not  columns  of  the  design  matrix.  Factors  requiring 
multiple  d.f.  will  be  retained  or  dropped  as  a  group.  The  function  prints  the 
deletion  statistics  for  each  variable  in  turn,  and  prints  approximate  parameter 
estimates  for  the  model  after  deleting  variables.  The  approximation  is  better 
when  the  number  of  factors  deleted  is  not  large.  For  ols,  the  approximation 
is  exact. 

The  which. influence  function  creates  a  list  with  a  component  for  each 
factor  in  the  model.  The  names  of  the  components  are  the  factor  names. 
Each  component  contains  the  observation  identifiers  of  ah  observations  that 
are  “overly  influential”  with  respect  to  that  factor,  meaning  that  | df betas |  >  u 
for  at  least  one  ft  associated  with  that  factor,  for  a  given  u.  The  default  u 
is  .2.  You  must  have  specified  x=TRUE,  y=TRUE  in  the  fitting  function  to  use 
which,  inf luence.  The  first  argument  is  the  fit  object,  and  the  second  argument 
is  the  cutoff  u. 

The  following  R  program  will  print  the  set  of  predictor  values  that  were 
very  influential  for  each  factor.  It  assumes  that  the  data  frame  containing  the 
data  used  in  the  fit  is  called  df . 

f  lrm(y  ~  xl  +  x2  +  . . . ,  data=df ,  x=TRUE ,  y=TRUE) 

w  which . inf luence (f ,  .4) 

nam  V-  names  (w) 

for(i  in  1 : length ( nam ) )  { 

cat ( " Inf luent i al  observations  for  effect  of", 
nam  [ i ]  , " \n " ) 
print (df  [w  [  [i] ]  ,]  ) 

} 

The  latex  function  is  a  generic  function  available  in  the  Hmisc  package.  It 
invokes  a  specific  latex  function  for  most  of  the  fit  objects  created  by  rms  to 
create  a  ETgX  algebraic  representation  of  the  fitted  model  for  inclusion  in  a 
report  or  viewing  on  the  screen.  This  representation  documents  ah  parameters 
in  the  model  and  the  functional  form  being  assumed  for  Y,  and  is  especially 
useful  for  getting  a  simplified  version  of  restricted  cubic  spline  functions.  On 
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the  other  hand,  the  print  method  with  optional  argument  latex=TRUE  is  used 
to  output  HTf^X  code  representing  the  model  results  in  tabular  form  to  the 
console.  This  is  intended  for  use  with  knitr6/  /  or  Sweave  >9. 

The  Function  function  composes  an  R  function  that  you  can  use  to  evaluate 

/\ 

X/3  analytically  from  a  fitted  regression  model.  The  documentation  for  Func¬ 
tion  also  shows  how  to  use  a  subsidiary  function  sascode  that  will  (almost) 
translate  such  an  R  function  into  SAS  code  for  evaluating  predicted  values  in 
new  subjects.  Neither  Function  nor  latex  handles  third-order  interactions. 

The  nomogram  function  draws  a  partial  nomogram  for  obtaining  predictions 
from  the  fitted  model  manually.  It  constructs  different  scales  when  interac¬ 
tions  (up  to  third-order)  are  present.  The  constructed  nomogram  is  not  com¬ 
plete,  in  that  point  scores  are  obtained  for  each  predictor  and  the  user  must 
add  the  point  scores  manually  before  reading  predicted  values  on  the  final 
axis  of  the  nomogram.  The  constructed  nomogram  is  useful  for  interpreting 
the  model  fit,  especially  for  non-monotonically  transformed  predictors  (their 
scales  wrap  around  an  axis  automatically). 

The  vif  function  computes  variance  inflation  factors  from  the  covariance 
matrix  of  a  fitted  model,  using  [147,654]. 

The  impute  function  is  another  generic  function.  It  does  simple  imputation 
by  default.  It  can  also  work  with  the  transcan  function  to  multiply  or  singly 
impute  missing  values  using  a  flexible  additive  model. 

As  an  example  of  using  many  of  the  functions,  suppose  that  a  categorical 
variable  treat  has  values  "a",  "b",  and  "c",  an  ordinal  variable  num. diseases 
has  values  0,1, 2, 3, 4,  and  that  there  are  two  continuous  variables,  age  and 
cholesterol,  age  is  fitted  with  a  restricted  cubic  spline,  while  cholesterol 
is  transformed  using  the  transformation  log(cholesterol+10) .  Cholesterol  is 
missing  on  three  subjects,  and  we  impute  these  using  the  overall  median 
cholesterol.  We  wish  to  allow  for  interaction  between  treat  and  cholesterol. 
The  following  R  program  will  fit  a  logistic  model,  test  all  effects  in  the  design, 
estimate  effects,  and  plot  estimated  transformations.  The  fit  for  num. diseases 
really  considers  the  variable  to  be  a  five-level  categorical  variable.  The  only 
difference  is  that  a  3  d.f.  test  of  linearity  is  done  to  assess  whether  the  variable 
can  be  remodeled  “asis”.  Here  we  also  show  statements  to  attach  the  rms 
package  and  store  predictor  characteristics  from  datadist. 

L 

require(rms)  #  make  new  functions  available 

ddist  V-  datadist  ( cholesterol  ,  treat  ,  num.  diseases  ,  age) 

#  Could  have  used  ddist  V-  dat adi st  ( dat a . frame . name ) 

opt  i  ons  (  dat  adi  st  =  "  ddi  st  "  )  #  defines  data  dist.  to  rms 

cholesterol  V-  impute ( cholesterol ) 

fit  V-  lrm(y  ~  treat  +  scored (num . diseases )  +  rcs(age)  + 

log ( cholesterol +10)  + 

treat : log(cholesterol+10)) 

describe (y  treat  +  scored (num . diseases )  +  res (age)) 

#  or  use  des  crib  e  ( f  ormul  a  ( f  it )  )  for  all  variables  used  in 

#  fit.  describe  function  (in  Hmisc)  gets  simple  statistics 

#  on  vari abl es 

#  fit  V-  robcov(fit)#  Would  make  all  statistics  that  follow 
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#  use  a  robust  covariance  matrix 

#  would  need  x  =  TRUE ,  y  =  TRUE  in  lrm() 

specs(fit)  #  Describe  the  design  characteristics 

anova (fit) 

anova (f it ,  treat,  cholesterol)  #  Test  these  2  by  themselves 
plot ( anova ( f it ) )  #  Summarize  anova  graphically 

summary(fit)  #  Est .  effects ;  default  ranges 

plot ( summary ( f it ) )  #  Graphical  display  of  effects  with  C.I. 

#  Specific  reference  cell  and  adjustment  value: 
summary (fit ,  treat="b",  age=60) 

#  Estimate  effect  of  increasing  age:  50->70 
summary (fit  ,  age  =  c (50 , 70) ) 

#  Increase  age  50->70,  adjust  to  60  when  estimating 

#  effects  of  other  factors: 
summary (fit  ,  age  =  c (50 , 60 , 70) ) 

#  If  had  not  defined  datadist  ,  would  have  to  define 

#  ranges  for  all  variables 


#  Estimate  and  test  treatment  (b-a)  effect  averaged 

#  over  3  chol est  erol s  : 

contrast (fit  ,  list (treat =  '  b  '  ,  cholesterol=c(150 ,200  ,250))  , 

list (treat =  'a  '  ,  cholesterol=c  (150 ,200  ,250))  , 
type= ' average ' ) 


p  Predict(fit,  age  =  seq  (20 , 80 ,  length =100)  ,  treat, 

conf . int=FALSE) 

plot (p)  #  Plot  relationship  between  age  and 

#  or  ggplot(p)  #  log  odds ,  separate  curve  for  each 

#  treat ,  no  C.I. 

plot(p,  ~  age  |  treat)  #  Same  but  2  panels 

ggplot(p,  groups =FALSE ) 

bp lot (Predict  (fit  ,  age  ,  cholesterol  ,  np=50)) 

#  3 - di mens i onal  perspective  plot  for 

#  age,  cholesterol,  and  log  odds 

#  using  default  ranges  for  both 

#  Plot  estimated  probabilities  instead  of  log  odds: 
plot (Predict  (fit  ,  num .diseases  , 

f un  =  f unct ion (x)  1 / ( 1  + exp ( -x ) )  , 

conf . int =  .  9 )  ,  ylab  =  "Prob") 

#  Again,  if  no  datadist  were  defined  ,  would  have  to  tell 

#  plot  all  limits 

logit  predict(fit,  expand . grid ( treat =" b ", num . di s =1 : 3 , 

age  =  c  (20 ,40,60)  , 

cholesterol  =  seq(100 ,300 , length  =10) ) ) 

#  Could  obtain  list  of  predictor  settings  interactively 

logit  predict(fit,  gendata(fit,  nobs=12)) 

#  An  easier  approach  is 

#  Predi ct  ( fit ,  treat  =  ' b'  ,  num . dis  =1 : 3 ,  .  .  . ) 


#  Since  age  doesn't  interact  with  anything  ,  we  can  quickly 

#  and  interactively  try  various  transformations  of  age, 

#  taking  the  spline  function  of  age  as  the  gold  standard. 

#  We  are  seeking  a  linearizing  transformation. 
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ag  10:80 

logit  predict (fit,  expand . grid ( treat =" a " ,  num.dis=0, 

age  =  ag , 

cholesterol =medi an (cholesterol)) , 
type  =  " terms  "  )  [ , " age " ] 

#  Note:  if  age  interacted  with  anything  ,  this  would  be  the 

#  age  ' main  effect*  1  ignoring  interaction  terms 

#  Could  also  use  logit  Predict(f,  age=ag,  . . . )%yhat  , 

#  which  allows  evaluation  of  the  shape  for  any  level  of 

#  interacting  factors.  When  age  does  not  interact  with 

#  anything  ,  the  result  from  predict  (f,  .  ..,  typ  e  = " t erms  " ) 

#  would  equal  the  result  from  Predict  if  all  other  terms 

#  were  ignored 

#  Could  also  specify: 

#  logit  predict(fit, 

#  gendata(fit,  age=ag ,  chol est erol = . . . ) ) 

#  Unmentioned  variables  are  set  to  reference  values 

plot(agA.5,  logit)  #  try  square  root  vs.  spline  transform. 
plot(agA1.5,  logit)  #  try  1.5  power 

#  Pretty  printing  of  table  of  estimates  and 

#  summary  statistics: 

print  (fit,  latex=TRUE)  #  print  ETjrX  code  to  console 
latex(fit)  #  invokes  latex. Irm ,  creates  fit.tex 

#  Draw  a  nomogram  for  the  model  fit 
plot (nomogram (fit)) 

#  Compose  R  function  to  evaluate  linear  predictors 

#  analytically 

g  Funct i on ( f it ) 

g(treat='b',  cholesterol =260 ,  age=50) 

#  Letting  num. diseases  default  to  reference  value 

To  examine  interactions  in  a  simpler  way,  you  may  want  to  group  age  into 
tertiles: 

■ 

age.tertile  <(—  cut2(age,  g  =  3) 

#  For  auto  ranges  later,  specify  age.tertile  to  datadist 

fit  lrm(y  ~  age.tertile  *  res ( cholesterol ) ) 

Example  output  from  these  functions  is  shown  in  Chapter  10  and  later 
chapters. 

Note  that  type=" terms"  in  predict  scores  each  factor  in  a  model  with  its 
fitted  transformation.  This  may  be  used  to  compute,  for  example,  rank  cor¬ 
relation  between  the  response  and  each  transformed  factor,  pretending  it  has 

1  d.f. 

When  regression  is  done  on  principal  components,  one  may  use  an  ordi¬ 
nary  linear  model  to  decode  “internal”  regression  coefficients  for  helping  to 
understand  the  final  model.  Here  is  an  example. 
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requir 

e ( rms ) 

L 

dd  <- 

dat  adi st ( my . 

dat  a  ) 

opt  ion 

s ( dat  adi st  =  ' 

dd  '  ) 

pcf  it 

V-  princomp  ( 

~  pain . 

symptoml  +  pain 

• 

symptom2  +  signl  + 

s  ign2 

+ 

sign3  +  smok 

i 

ng) 

pc2  V- 

pcf it  $  score 

s  [  ,  1  :  2] 

#  firs 

t 

2 

PCs  as  matrix 

logi st 

ic.fit  lrm(death 

res ( age  ,4)  + 

pc2 

) 

pr edi c 

ted. logit  <- 

pr edi c 

t  (1 

ogistic . f it  ) 

linear 

.mod  V- 

ols ( pr 

edi 

ct ed . logit  ~ 

res 

(age ,4)  + 

pa 

in  . 

symptoml  +  p 

a 

in  . 

symptom2  + 

s  i 

gnl 

+  sign2  +  s 

i 

gn3 

+  smoking) 

#  This 

model  will 

have  R- 

squ 

ared=l 

nom 

nomogram ( 1 i 

near  .  mo< 

fun=f unction 

( 

x  )  1 

/ ( 1  +  exp (-x) )  , 
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In  addition  to  many  of  the  add-on  functions  described  above,  there  are 
several  other  R  functions  that  validate  models.  The  first,  predab. resample, 
is  a  general-purpose  function  that  is  used  by  functions  for  specific  models 
described  later,  predab. resample  computes  estimates  of  optimism  and  bias- 
corrected  estimates  of  a  vector  of  indexes  of  predictive  accuracy,  for  a  model 
with  a  specified  design  matrix,  with  or  without  fast  backward  step-down  of 
predictors.  If  bw=TRUE,  predab. resample  prints  a  matrix  of  asterisks  showing 
which  factors  were  selected  at  each  repetition,  along  with  a  frequency  dis¬ 
tribution  of  the  number  of  factors  retained  across  resamples.  The  function 
has  an  optional  parameter  that  may  be  specified  to  force  the  bootstrap  al¬ 
gorithm  to  do  sampling  with  replacement  from  clusters  rather  than  from 
original  records,  which  is  useful  when  each  subject  has  multiple  records  in 
the  dataset.  It  also  has  a  parameter  that  can  be  used  to  validate  predictions 
in  a  subset  of  the  records  even  though  models  are  refit  using  all  records. 

The  generic  function  validate  invokes  predab. resample  with  model-specific 
fits  and  measures  of  accuracy.  The  function  calibrate  invokes  predab. resample 
to  estimate  bias-corrected  model  calibration  and  to  plot  the  calibration  curve. 
Model  calibration  is  estimated  at  a  sequence  of  predicted  values. 


6.4  Other  Functions 

For  principal  component  analysis,  R  has  the  pr incomp  and  prcomp  functions. 
Canonical  correlations  and  canonical  variates  can  be  easily  computed  us¬ 
ing  the  cancor  function.  There  are  many  other  R  functions  for  examining 
associations  and  for  fitting  models.  The  supsmu  function  implements  Fried¬ 
man’s  “super  smoother.”20  The  lowess  function  implements  Cleveland’s  two- 
dimensional  smoother.111  The  glm  function  will  fit  general  linear  models  under 
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a  wide  variety  of  distributions  of  Y.  There  are  functions  to  fit  Hastie  and  Tib- 
shirani’s2^5  generalized  additive  model  for  a  variety  of  distributions.  More  is 
said  about  parametric  and  nonparametric  additive  multiple  regression  func¬ 
tions  in  Chapter  16.  The  loess  function  fits  a  multidimensional  scatterplot 
smoother  (the  local  regression  model  of  Cleveland  et  al.96).  loess  provides 
approximate  test  statistics  for  normal  or  symmetrically  distributed  Y : 


loess  has  a  large  number  of  options  allowing  various  restrictions  to  be  placed 
on  the  fitted  surface. 

Atkinson  and  Therneau’s  rpart  recursive  partitioning  package  and  related 
functions  implement  classification  and  regression  trees69  algorithms  for  bi¬ 
nary,  continuous,  and  right-censored  response  variables  (assuming  an  expo¬ 
nential  distribution  for  the  latter),  rpart  deals  effectively  with  missing  predic¬ 
tor  values  using  surrogate  splits.  The  rms  package  has  a  validate  function  for 
rpart  objects  for  obtaining  cross- validated  mean  squared  errors  and  Somers’ 
Dxy  rank  correlations  (Brier  score  and  ROC  areas  for  probability  models). 

For  displaying  which  variables  tend  to  be  missing  on  the  same  subjects, 
the  Hmisc  naclus  function  can  be  used  (e.g.,  plot  (naclus(dataframename) )  or 
naplot  (naclus  (  dataf ramename) ) ).  For  characterizing  what  type  of  subjects 
have  na’s  on  a  given  predictor  (or  response)  variable,  a  tree  model  whose 
response  variable  is  is.na(varname)  can  be  quite  useful. 

L 

require (rpart  ) 

f  V-  rpart ( is . na ( cholesterol )  ~  age  +  sex  +  trig  +  smoking) 
plot(f)  #  plots  the  tree 

text(f)  #  labels  the  tree 

The  Hmisc  rcorr.cens  function  can  compute  Somers’  Dxy  rank  correla¬ 
tion  coefficient  and  its  standard  error,  for  binary  or  continuous  (and  possibly 
right-censored)  responses.  A  simple  transformation  of  Dxy  yields  the  c  index 
(generalized  ROC  area).  The  Hmisc  improveProb  function  is  useful  for  compar¬ 
ing  two  probability  models  using  the  methods  of  Pencina  etal  i90, 492,49  in  an 
external  validation  setting.  See  also  the  rcorrp.cens  function  in  this  context. 


6.5  Further  Reading 
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Harrell  and  Goldstein263  list  components  of  statistical  languages  or  packages 
and  compare  several  popular  packages  for  survival  analysis  capabilities. 

Imai  et  al.319  have  further  generalized  R  as  a  statistical  modeling  language. 


Chapter  7 

Modeling  Longitudinal  Responses  using 
Generalized  Least  Squares 


In  this  chapter  we  consider  models  for  a  multivariate  response  variable  repre¬ 
sented  by  serial  measurements  over  time  within  subject.  This  setup  induces 
correlations  between  measurements  on  the  same  subject  that  must  be  taken 
into  account  to  have  optimal  model  fits  and  honest  inference.  Full  likelihood 
model-based  approaches  have  advantages  including  (1)  optimal  handling  of 
imbalanced  data  and  (2)  robustness  to  missing  data  (dropouts)  that  occur 
not  completely  at  random.  The  three  most  popular  model-based  full  like¬ 
lihood  approaches  are  mixed  effects  models,  generalized  least  squares,  and 
Bayesian  hierarchical  models.  For  continuous  T,  generalized  least  squares 
has  a  certain  elegance,  and  a  case  study  will  demonstrate  its  use  after  sur¬ 
veying  competing  approaches.  As  OLS  is  a  special  case  of  generalized  least 
squares,  the  case  study  is  also  helpful  in  developing  and  interpreting  OLS 
modelsa. 

Some  good  references  on  longitudinal  data  analysis 
include148, 159, 252, 414, 509, 635, 637 


7.1  Notation  and  Data  Setup 

Suppose  there  are  N  independent  subjects,  with  subject  i  ( i  =  1,2,...,  N) 
having  rii  responses  measured  at  times  tn,  te,  •  •  • ,  Uni .  The  response  at  time  t 
for  subject  i  is  denoted  by  Yjjt.  Suppose  that  subject  i  has  baseline  covariates 
Xi.  Generally  the  response  measured  at  time  tn  =  0  is  a  covariate  in  Xi 
instead  of  being  the  first  measured  response  Y^q. 

For  flexible  analysis,  longitudinal  data  are  usually  arranged  in  a  “tall  and 
thin”  layout.  This  allows  measurement  times  to  be  irregular.  In  studies  com¬ 


a  A  case  study  in  OLS — Chapter  7  from  the  first  edition — may  be  found  on  the  text’s 
web  site. 
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paring  two  or  more  treatments,  a  response  is  often  measured  at  baseline 
(pre-randomization).  The  analyst  has  the  option  to  use  this  measurement  as 
Y$o  or  as  part  of  Xi.  There  are  many  reasons  to  put  initial  measurements  of 
Y  in  X,  i.e. ,  to  use  baseline  measurements  as  baseline  . 


7.2  Model  Specification  for  Effects  on  E(Y) 


Longitudinal  data  can  be  used  to  estimate  overall  means  or  the  mean  at  the 
last  scheduled  follow-up,  making  maximum  use  of  incomplete  records.  But  the 
real  value  of  longitudinal  data  comes  from  modeling  the  entire  time  course. 
Estimating  the  time  course  leads  to  understanding  slopes,  shapes,  overall 
trajectories,  and  periods  of  treatment  effectiveness.  With  continuous  Y  one 
typically  specifies  the  time  course  by  a  mean  time-response  profile.  Common 
representations  for  such  profiles  include 

•  k  dummy  variables  for  k  +  1  unique  times  (assumes  no  functional  form  for 
time  but  assumes  discrete  measurement  times  and  may  spend  many  d.f.) 

•  k  =  1  for  linear  time  trend,  g\  (t)  =  t 

•  k- order  polynomial  in  t 

•  k  +  1-knot  restricted  cubic  spline  (one  linear  term,  k  —  1  nonlinear  terms) 

Suppose  the  time  trend  is  modeled  with  k  parameters  so  that  the  time 
effect  has  k  d.f.  Let  the  basis  functions  modeling  the  time  effect  be  gi(t), 
#2 (£)>•••  >  9k(t)  to  allow  it  to  be  nonlinear.  A  model  for  the  time  profile  with¬ 
out  interactions  between  time  and  any  X  is  given  by 


Xif3  +  7i#i(£)  +  72^2  (t)  +  •  •  •  +  7  k9k{t)- 


(7.1) 


To  allow  the  slope  or  shape  of  the  time-response  profile  to  depend  on  some 
of  the  As  we  add  product  terms  for  desired  interaction  effects.  For  example, 
to  allow  the  mean  time  trend  for  subjects  in  group  1  (reference  group)  to 
be  arbitrarily  different  from  the  time  trend  for  subjects  in  group  2,  have  a 
dummy  variable  for  group  2,  a  time  “main  effect”  curve  with  k  d.f.  and  all  k 
products  of  these  time  components  with  the  dummy  variable  for  group  2. 

Once  the  right  hand  side  of  the  model  is  formulated,  predicted  values, 
contrasts,  and  ANOVAs  are  obtained  just  as  with  a  univariate  model.  For 
these  purposes  time  is  no  different  than  any  other  covariate  except  for  what 
is  described  in  the  next  section. 


7.3  Modeling  Within-Subject  Dependence 

Sometimes  understanding  within-subject  correlation  patterns  is  of  interest 
in  itself.  More  commonly,  accounting  for  intra-subject  correlation  is  crucial 
for  inferences  to  be  valid.  Some  methods  of  analysis  cover  up  the  correlation 
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pattern  while  others  assume  a  restrictive  form  for  the  pattern.  The  following 
table  is  an  attempt  to  briefly  survey  available  longitudinal  analysis  meth¬ 
ods.  LOCF  and  the  summary  statistic  method  are  not  modeling  methods. 
LOCF  is  an  ad  hoc  attempt  to  account  for  longitudinal  dropouts,  and  sum¬ 
mary  statistics  can  convert  multivariate  responses  to  univariate  ones  with  few 
assumptions  (other  than  minimal  dropouts),  with  some  information  loss. 


What  Methods  To  Use  for  Repeated  Measurements  / 

Serial  Data?  ab 


Repeated  GEE  Mixed  GLS  LOCF  Summary 

Measures 

Effects 

Statistic0 

ANOVA 

Model 

Assumes  normality 

X 

X 

X 

Assumes  independence  of 

X  d 

xe 

measurements  within  subject 
Assumes  a  correlation  structuref 

X 

x§ 

X 

X 

Requires  same  measurement 

X 

? 

times  for  all  subjects 

Does  not  allow  smooth  modeling 

X 

of  time  to  save  d.f. 

Does  not  allow  adjustment  for 

X 

baseline  covariates 

Does  not  easily  extend  to 

X 

X 

non-continuous  Y 

Loses  information  by  not  using 

xh 

X 

intermediate  measurements 

Does  not  allow  widely  varying  # 

X 

X 1 

X 

xj 

of  observations  per  subject 

Does  not  allow  for  subjects 
to  have  distinct  trajectoriesk 
Assumes  subject-specific  effects 

X 

X 

X 

X 

X 

are  Gaussian 

Badly  biased  if  non-random 

? 

X 

X 

dropouts 

Biased  in  general 

X 

Harder  to  get  tests  &  CLs 

Requires  large  #  subjects/clusters 

X 

X1 

xm 

SEs  are  wrong 

xn 

X 

Assumptions  are  not  verifiable 

X 

N/A 

X 

X 

X 

in  small  samples 

Does  not  extend  to  complex 

X 

X 

X 

X 

? 

settings  such  as  time-dependent 
covariates  and  dynamic0  models 

a  Thanks  to  Charles  Berry,  Brian  Cade,  Peter  Flom,  Bert  Gunter,  and  Leena  Choi 
for  valuable  input. 

b  GEE:  generalized  estimating  equations;  GLS:  generalized  least  squares;  LOCF:  last 
observation  carried  forward. 

c  E.g.,  compute  within-subject  slope,  mean,  or  area  under  the  curve  over  time.  As¬ 
sumes  that  the  summary  measure  is  an  adequate  summary  of  the  time  profile  and 
assesses  the  relevant  treatment  effect. 
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The  most  prevalent  full  modeling  approach  is  mixed  effects  models  in  which 
baseline  predictors  are  fixed  effects,  and  random  effects  are  used  to  describe 
subject  differences  and  to  induce  within-subject  correlation.  Some  disadvan¬ 
tages  of  mixed  effects  models  are 


•  The  induced  correlation  structure  for  Y  may  be  unrealistic  if  care  is  not 
taken  in  specifying  the  model. 

•  Random  effects  require  complex  approximations  for  distributions  of  test 
statistics. 

•  The  most  commonly  used  models  assume  that  random  effects  follow  a 
normal  distribution.  This  assumption  may  not  hold. 


It  could  be  argued  that  an  extended  linear  model  (with  no  random  effects) 
is  a  logical  extension  of  the  univariate  OLS  model  b.  This  model,  called  the 
generalized  least  squares  or  growth  curve  model  21, 509, 51°,  was  developed  long 
before  mixed  effect  models  became  popular. 

We  will  assume  that  Yu\Xi  has  a  multivariate  normal  distribution  with 
mean  given  above  and  with  variance-covariance  matrix  T^,  an  rii  x  rii  matrix 
that  is  a  function  of  tn, . . . ,  tiUi .  We  further  assume  that  the  diagonals  of  Vi 
are  all  equal5.  This  extended  linear  model  has  the  following  assumptions: 


all  the  assumptions  of  OLS  at  a  single  time  point  including  correct  mod¬ 
eling  of  predictor  effects  and  univariate  normality  of  responses  conditional 
on  X 


d  Unless  one  uses  the  Huynh-Feldt  or  Greenhouse-Geisser  correction 
e  For  full  efficiency,  if  using  the  working  independence  model 
f  Or  requires  the  user  to  specify  one 
g  For  full  efficiency  of  regression  coefficient  estimates 
h  Unless  the  last  observation  is  missing 

1  The  cluster  sandwich  variance  estimator  used  to  estimate  SEs  in  GEE  does  not 
perform  well  in  this  situation,  and  neither  does  the  working  independence  model 
because  it  does  not  weight  subjects  properly. 

j  Unless  one  knows  how  to  properly  do  a  weighted  analysis 
k  Or  uses  population  averages 

1  Unlike  GLS,  does  not  use  standard  maximum  likelihood  methods  yielding  simple 
likelihood  ratio  y2  statistics.  Requires  high-dimensional  integration  to  marginalize 
random  effects,  using  complex  approximations,  and  if  using  SAS,  unintuitive  d.f.  for 
the  various  tests. 

m  Because  there  is  no  correct  formula  for  SE  of  effects;  ordinary  SEs  are  not  penalized 
for  imputation  and  are  too  small 

n  If  correction  not  applied 

°  E.g.,  a  model  with  a  predictor  that  is  a  lagged  value  of  the  response  variable 

b  E.g.,  few  statisticians  use  subject  random  effects  for  univariate  Y.  Pinheiro  and 
Bates  [509,  Section  5.1.2]  state  that  “in  some  applications,  one  may  wish  to  avoid 
incorporating  random  effects  in  the  model  to  account  for  dependence  among  obser¬ 
vations,  choosing  to  use  the  within-group  component  Ai  to  directly  model  variance- 
covariance  structure  of  the  response.” 

b  This  procedure  can  be  generalized  to  allow  for  heteroscedasticity  over  time  or  with 
respect  to  A,  e.g.,  males  may  be  allowed  to  have  a  different  variance  than  females. 
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•  the  distribution  of  two  responses  at  two  different  times  for  the  same  sub¬ 
ject,  conditional  on  X,  is  bivariate  normal  with  a  specified  correlation 
coefficient 

•  the  joint  distribution  of  all  rq  responses  for  the  ith  subject  is  multivariate 
normal  with  the  given  correlation  pattern  (which  implies  the  previous  two 
distributional  assumptions) 

•  responses  from  two  different  subjects  are  uncorrelated. 


7.4  Parameter  Estimation  Procedure 

Generalized  least  squares  is  like  weighted  least  squares  but  uses  a  covariance 
matrix  that  is  not  diagonal.  Each  subject  can  have  her  own  shape  of  Vi  due 
to  each  subject  being  measured  at  a  different  set  of  times.  This  is  a  maximum 
likelihood  procedure.  Newton-Raphson  or  other  trial-and-error  methods  are 
used  for  estimating  parameters.  For  a  small  number  of  subjects,  there  are  ad¬ 
vantages  in  using  REML  (restricted  maximum  likelihood)  instead  of  ordinary 
MLE  [159,  Section  5.3]  [509,  Chapter  5] 221  (especially  to  get  a  more  unbiased 
estimate  of  the  covariance  matrix). 

When  imbalances  of  measurement  times  are  not  severe,  OLS  fitted  ignoring 
subject  identifiers  may  be  efficient  for  estimating  /?.  But  OLS  standard  errors 
will  be  too  small  as  they  don’t  take  intra-cluster  correlation  into  account. 
This  may  be  rectified  by  substituting  a  covariance  matrix  estimated  using 
the  Huber- White  cluster  sandwich  estimator  or  from  the  cluster  bootstrap. 
When  imbalances  are  severe  and  intra-subject  correlations  are  strong,  OLS 
(or  GEE  using  a  working  independence  model)  is  not  expected  to  be  efficient 
because  it  gives  equal  weight  to  each  observation;  a  subject  contributing  two 
distant  observations  receives  |  the  weight  of  a  subject  having  10  tightly- 
spaced  observations. 


7.5  Common  Correlation  Structures 

We  usually  restrict  ourselves  to  isotropic  correlation  structures  which  assume 
the  correlation  between  responses  within  subject  at  two  times  depends  only  on 
a  measure  of  the  distance  between  the  two  times,  not  the  individual  times. 
We  simplify  further  and  assume  it  depends  on  1 1\  —  ^1° •  Assume  that  the 
correlation  coefficient  for  Yitl  vs.  Ya2  conditional  on  baseline  covariates  Xi 
for  subject  i  is  h(\ti  —  £2!,/?),  where  p  is  a  vector  (usually  a  scalar)  set  of 
fundamental  correlation  parameters.  Some  commonly  used  structures  when 

c  We  can  speak  interchangeably  of  correlations  of  residuals  within  subjects  or  correla¬ 
tions  between  responses  measured  at  different  times  on  the  same  subject,  conditional 
on  covariates  X. 
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times  are  continuous  and  are  not  equally  spaced  [509,  Section  5.3.3]  are  shown 
below,  along  with  the  correlation  function  names  from  the  R  nlme  package. 


Compound  symmetry:  h  =  p  if  t\  ^  £2,  1  if  1 1  =  £2 
(Essentially  what  two-way  ANOVA  assumes) 
Autoregressive-moving  average  lag  1:  h  —  p  bi-Ll  —  ps 

where  s  —  1 1 1  —  £2 1 
Exponential:  /i  =  exp(— s/p) 

Gaussian:  h  =  exp  [—(s/p) 2 

Linear:  h  =  (1  —  s/p)[s  <  p 

Rational  quadratic:  h  =  1  —  (s/p)2 /[  1  +  (s/p)2} 

Spherical:  h  =  [1  —  1.5 (s/p)  +  0.5(s/p)3]  [s  <  p] 

7  I  C  &  d  yyy  'i  ryy 

Linear  exponent  AR(1):  h  =  p  <*maa:-dmin  ?  1  if  =  £2 572 


nlme  corCompSymm 

corCARl 

corExp 

corGaus 

corLin 

corRatio 

corSpher 


The  structures  3-7  use  p  as  a  scaling  parameter,  not  as  something  re¬ 
stricted  to  be  in  [0, 1] 


7.6  Checking  Model  Fit 

The  constant  variance  assumption  may  be  checked  using  typical  residual 
plots.  The  univariate  normality  assumption  (but  not  multivariate  normal¬ 
ity)  may  be  checked  using  typical  Q-Q  plots  on  residuals.  For  checking  the 
correlation  pattern,  a  variogram  is  a  very  helpful  device  based  on  estimating 
correlations  of  all  possible  pairs  of  residuals  at  different  time  pointsd.  Pairs 
of  estimates  obtained  at  the  same  absolute  time  difference  s  are  pooled.  The 
variogram  is  a  plot  with  y  —  1  —  h(s ,  p)  vs.  s  on  the  x-axis,  and  the  theoretical 
variogram  of  the  correlation  model  currently  being  assumed  is  superimposed. 


7.7  Sample  Size  Considerations 

Section  4.4  provided  some  guidance  about  sample  sizes  needed  for  OLS. 
A  good  way  to  think  about  sample  size  adequacy  for  generalized  least  squares 
is  to  determine  the  effective  number  of  independent  observations  that  a  given 
configuration  of  repeated  measurements  has.  For  example,  if  the  standard  er¬ 
ror  of  an  estimate  from  three  measurements  on  each  of  20  subjects  is  the  same 
as  the  standard  error  from  27  subjects  measured  once,  we  say  that  the  20x3 
study  has  an  effective  sample  size  of  27,  and  we  equate  power  from  the  uni¬ 
variate  analysis  on  n  subjects  measured  once  to  subjects  measured  three 
times.  Faes  et  al.  have  a  nice  approach  to  effective  sample  sizes  with  a 
variety  of  correlation  patterns  in  longitudinal  data.  For  an  AR(1)  correlation 
structure  with  n  equally  spaced  measurement  times  on  each  of  N  subjects, 


d 


Variograms  can  be  unstable. 


7.9  Case  Study 
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with  the  correlation  between  two  consecutive  times  being  p,  the  effective 
sample  size  is  N.  Under  compound  symmetry,  the  effective  size  is 

nN 

l+p(n— 1)  ’ 


7.8  R  Software 

The  nonlinear  mixed  effects  model  package  nlme  of  Pinheiro  &  Bates  in 
Rprovides  many  useful  functions.  For  fitting  linear  models,  fitting  functions 
are  lme  for  mixed  effects  models  and  gls  for  generalized  least  squares  without 
random  effects.  The  rms  package  has  a  front-end  function  Gls  so  that  many 
features  of  rms  can  be  used: 

anova:  all  partial  Wald  tests,  test  of  linearity,  pooled  tests 

summary:  effect  estimates  (differences  in  Y)  and  confidence  limits 

Predict  and  plot:  partial  effect  plots 
nomogram:  nomogram 

Function:  generate  R  function  code  for  the  fitted  model 

latex:  UTf^X  representation  of  the  fitted  model. 

In  addition,  Gls  has  a  cluster  bootstrap  option  (hence  you  do  not  use  rms’s 
bootcov  for  Gls  fits).  When  B  is  provided  to  Gls(  ),  bootstrapped  regression 
coefficients  and  correlation  estimates  are  saved,  the  former  setting  up  for 
bootstrap  percentile  confidence  limits6  The  nlme  package  has  many  graphics 
and  fit-checking  functions.  Several  functions  will  be  demonstrated  in  the  case 
study. 


7.9  Case  Study 

Consider  the  dataset  in  Table  6.9  of  Davis  [148,  pp.  161-163]  from  a  multi¬ 
center,  randomized  controlled  trial  of  botulinum  toxin  type  B  (BotB)  in  pa¬ 
tients  with  cervical  dystonia  from  nine  U.S.  sites.  Patients  were  randomized 
to  placebo  (N  =  36),  5000  units  of  BotB  (N  =  36),  or  10,000  units  of  BotB 
(N  =  37).  The  response  variable  is  the  total  score  on  the  Toronto  Western 
Spasmodic  Torticollis  Rating  Scale  (TWSTRS),  measuring  severity,  pain,  and 
disability  of  cervical  dystonia  (high  scores  mean  more  impairment).  TWSTRS 
is  measured  at  baseline  (week  0)  and  weeks  2,  4,  8,  12,  16  after  treatment 
began.  The  dataset  name  on  the  dataset  wiki  page  is  cdystonia. 


e  To  access  regular  gls  functions  named  anova  (for  likelihood  ratio  tests,  AIC,  etc.) 
or  summary  use  anova. gls  or  summary. gls. 
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7. 9. 1  Graphical  Exploration  of  Data 

Graphics  which  follow  display  raw  data  as  well  as  quartiles  of  TWSTRS  by 
time,  site,  and  treatment.  A  table  shows  the  realized  measurement  schedule. 

L 

require ( rms ) 


getHdata (cdystonia) 
attach (cdystonia) 

#  Construct  unique  subject  ID 

uid  V-  with  (  cdystonia  ,  factor  (paste  (  site  ,  id))) 

#  Tabulate  patterns  of  subjects  '  time  points 
table (tapply (week  ,  uid  , 

function(w)  paste ( sort (unique  (w) )  ,  collapse  =  1  '))) 


#  Plot  raw  data ,  superposing  subjects 

xl  V-  xlab('Week');  yl  V-  ylab('TWSTRS-total  score') 
ggplot  ( cdystonia  ,  aes (x  =  week  ,  y  =  twstrs  ,  color  =  f actor  (  id  ))  )  + 
geom.line ()  +  xl  +  yl  +  f acet_gr id ( treat  ~  site)  + 

guides  ( color =FALSE  )  #  Fig.  7.1 


#  Show  quartiles 

ggplot  ( cdystonia  ,  aes (x  =  week  ,  y  =  twstrs  ) )  +  xl  +  yl  + 

ylim(0,  70)  +  st at .summary ( fun . data= " median_hilow " , 

conf . int =0 . 5 ,  geom ='  smooth  '  )  + 

f acet.wrap (^  treat ,  nrow=2)  #  Fig.  7.2 


Next  the  data  are  rearranged  so  that  l^o  is  a  baseline  covariate. 


baseline 

subset (data. frame (cdystonia ,uid) , 
-week  ) 

L 

week  ==  0 , 

baseline 

upData (baseline  ,  rename = c ( twstrs = 
print  =FALSE ) 

' twstrsO  ')  , 

f  ollowup 

rm ( uid ) 

subset (data. frame (cdystonia ,uid) , 
c(uid , week , twstrs )) 

week  >  0, 

both 

merge (baseline  ,  followup  ,  by= 'uid 

') 

dd 

datadist (both) 

options (datadist  =  'dd  ') 
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Week 

Fig.  7.1  Time  profiles  for  individual  subjects,  stratified  by  study  site  and  dose 


7.9.2  Using  Generalized  Least  Squares 


We  stay  with  baseline  adjustment  and  use  a  variety  of  correlation  structures, 
with  constant  variance.  Time  is  modeled  as  a  restricted  cubic  spline  with 
3  knots,  because  there  are  only  3  unique  interior  values  of  week.  Below,  six 
correlation  patterns  are  attempted.  In  general  it  is  better  to  use  scientific 
knowledge  to  guide  the  choice  of  the  correlation  structure. 

L 

require ( nlme ) 


cp  V- 

list 

; (corCARl  , corExp  ,  corCompSymm 

,  corL 

in  , corGaus  , 

L 

corSpher  ) 

z  V- 

ve  c 

;tor  (  ' 

list ' , length (cp)) 

f  or  (k 

in 

1 :  len 

gth(cp))  { 

1 — 1 

1 — 1 

1 _ 1 

1 _ 1 

N 

gl 

s (twstrs  ~  treat  *  res 

(week 

,  3)  + 

res (twstrsO  ,  3)  +  res 

(age  , 

4)  *  sex  , 

data  =  both  , 

correlation  =  cp  [  [k] ]  (f 

orm  = 

~week  |  ui 

d)) 

} 

anova  ( z  [ [ 1]  ]  , z [  [2] ]  ,  z [  [3] ]  ,  z  [  [4]  ]  ,  z  [  [5] ]  ,  z [  [6] ]  ) 


Model 

df 

AIC 

BIC 

logLik 

z  [[1]] 

1 

20 

3553 . 906 

3638 . 357 

-1756 . 953 

Z [[2]] 

2 

20 

3553 . 906 

3638 . 357 

-1756 . 953 

z [[3]] 

3 

20 

3587 . 974 

3672 . 426 

-1773 . 987 

z [[4]] 

4 

20 

3575 . 079 

3659 . 531 

-1767 . 540 

z [[5]] 

5 

20 

3621 . 081 

3705 . 532 

-1790 . 540 

z [[6]] 

6 

20 

3570 . 958 

3655 . 409 

-1765 . 479 
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1 oooou 


0  5  10  15 


Week 


Fig.  7.2  Quartiles  of  TWSTRS  stratified  by  dose 


AIC  computed  above  is  set  up  so  that  smaller  values  are  best.  From  this 
the  continuous-time  AR1  and  exponential  structures  are  tied  for  the  best. 
For  the  remainder  of  the  analysis  we  use  corCARl,  using  Gls. 


a  V-  Gls(twstrs  ~  treat  *  res (week ,  3)  +  res (twstrsO ,  3)  + 

res (age,  4)  *  sex,  data=both , 

correlat ion = corCARl ( f orm=~week  |  uid)) 


print (a,  latex=TRUE) 


Generalized  Least  Squares  Fit  by  REML 


Gls (model  =  twstrs  ~  treat  *  res (week,  3)  +  res (twstrsO,  3)  + 
res (age,  4)  *  sex,  data  =  both,  correlation  =  corCARl 
(form  =  "week  |  uid)) 


Obs  522 

Log-restricted-likelihood 

-1756.95 

Clusters  108 

Model  d.f. 

17 

g  11.334 

a 

8.5917 

d.f. 

504 
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Coef  S.E.  t  Pr(>  |£|) 


Intercept 

-0.3093 

11.8804  -0.03 

0.9792 

treat=5000U 

0.4344 

2.5962  0.17 

0.8672 

treat=Placebo 

7.1433 

2.6133  2.73 

0.0065 

week 

0.2879 

0.2973  0.97 

0.3334 

week’ 

0.7313 

0.3078  2.38 

0.0179 

twstrsO 

0.8071 

0.1449  5.57  <  0.0001 

twstrsO’ 

0.2129 

0.1795  1.19 

0.2360 

age 

-0.1178 

0.2346  -0.50 

0.6158 

age’ 

0.6968 

0.6484  1.07 

0.2830 

age” 

-3.4018 

2.5599  -1.33 

0.1845 

sex=M 

24.2802 

18.6208  1.30 

0.1929 

treat=5000U  *  week 

0.0745 

0.4221  0.18 

0.8599 

treat=Placebo  *  week 

-0.1256 

0.4243  -0.30 

0.7674 

treat=5000U  *  week’ 

-0.4389 

0.4363-1.01 

0.3149 

treat=Placebo  *  week’ 

-0.6459 

0.4381  -1.47 

0.1411 

age  *  sex=M 

-0.5846 

0.4447-1.31 

0.1892 

age’  *  sex=M 

1.4652 

1.2388  1.18 

0.2375 

age”  *  sex=M 

-4.0338 

4.8123  -0.84 

0.4023 

Correlation  Structure:  Continuous  AR(1) 
Formula:  ~week  |  uid 
Parameter  estimate (s): 


Phi 

0.8666689 


p  =  0.867,  the  estimate  of  the  correlation  between  two  measurements 
taken  one  week  apart  on  the  same  subject.  The  estimated  correlation  for 
measurements  10  weeks  apart  is  0.86710  =  0.24. 


v  V- 

Variogram (a , 

f orm=~  week 

uid) 

plot  ( 

v)  #  Figure 

7.3 

The  empirical  variogram  is  largely  in  agreement  with  the  pattern  dictated  by 
AR(1). 

Next  check  constant  variance  and  normality  assumptions. 


both  $ ] 

resid  V-  r  V- 

resid (a) ; 

both  $ 

f 

itted 

v- 

fitted 

(a 

L 

) 

yi 

V- 

ylab ( ' Residua 

ils  ') 

pi 

v- 

ggplot (both  , 

aes (x  =  f itted  ,  y 

= 

resid 

)) 

+  geom. 

po 

int  ( )  + 

f acet_grid  (~ 

treat )  + 

yi 

P2 

v- 

ggplot (both  , 

aes (x=tws 

trsO  , 

y 

=  r  e  s  i 

d)) 

+  geom 

-P 

o int  ( )  +  yl 

p3 

v- 

ggplot (both  , 

aes (x=wee 

k  ,  y  =  r 

e 

sid  )  ) 

+ 

yi  +  yi 

im 

(-20,20)  + 

st  at  _  summary ( fun . dat  a  = 

" mean. 

s 

dl  "  , 

geom= ' smoo 

th 

') 

p4 

V- 

ggplot (both  , 

aes ( sampl 

e=resi 

d 

))  + 

s t  at  _qq  (  ) 

+ 

geom_abline (intercept= 

mean  (r 

) 

,  slo 

pe  = 

sd (r  )  ) 

+ 

yi 

gr: 

LdExtra : : grid. arrange (pi , 

p2  ,  p3 

5 

P4, 

nco 

1=2) 

# 

Figure  7.4 
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Fig.  7.3  Variogram,  with  assumed  correlation  pattern  superimposed 


These  model  assumptions  appear  to  be  well  satisfied,  so  inferences  are  likely 
to  be  trustworthy  if  the  more  subtle  multivariate  assumptions  hold. 

Now  get  hypothesis  tests,  estimates,  and  graphically  interpret  the  model. 


plot ( anova  ( a)  ) 

L 

#  Figure  7.5 

L 

ylm  V-  ylim(25,  60) 

Pi  V-  ggplot  (  Pr  edi  ct  (  a  ,  week,  treat,  conf  .  int  =FALSE  )  , 


ad j . subt it le =FALSE  ,  legend . posit  ion =' top  '  )  +  ylm 

p2  V-  ggplot  (  Pr  edi  ct  (  a  ,  twstrsO),  ad  j  .  subt  it  le  =FALSE  )  +  ylm 
p3  V-  ggplot  ( Pr edi ct ( a ,  age,  sex),  ad  j  . subt it le =FALSE  , 


legend . pos it  ion =' top  '  )  +  ylm 

gridExtra :: grid . arrange (pi ,  p2 ,  p3 ,  ncol=2)  #  Figure  7.6 


latex ( summary (a)  , f ile  = '  '  ,  table.env=FALSE)  # 

Shows  for  week  8 

Low  High  A  Effect  S.E. 

Lower  0.95  Upper  0.95 

week 

4 

12  8 

6.69100  1.10570 

4.5238 

8.8582 

twstrsO 

39 

53  14  13.55100  0.88618 

11.8140 

15.2880 

age 

46 

65  19 

2.50270  2.05140 

-1.5179 

6.5234 

treat  5000U:10000U 

1 

2 

0.59167  1.99830 

-3.3249 

4.5083 

treat  Placebo:  10000U 

1 

3 

5.49300  2.00430 

1.5647 

9.4212 

sex  M:F 

1 

2 

-1.08500  1.77860 

-4.5711 

2.4011 

L 

#  To  get  results  for  week  8  for  a  different  reference  group 


#  for  treatment ,  use  e.g.  summary  (a,  week=4 ,  treat  =  ' PI aceb o  ') 

#  Compare  low  dose  with  placebo,  separately  at  each  time 
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10000U  5000U  Placebo 


20  30  40  50  60  7020  30  40  50  60  7020  30  40  50  60  70 

fitted 


20- 


(f) 

co 

-g 

to 

CD 

DC 


0- 


-20- 


-40- 


30  40  50  60 

twstrsO 


week 


theoretical 


Fig.  7.4  Three  residual  plots  to  check  for  absence  of  trends  in  central  tendency 
and  in  variability.  Upper  right  panel  shows  the  baseline  score  on  the  x-axis.  Bottom 
left  panel  shows  the  mean  ±2xSD.  Bottom  right  panel  is  the  QQ  plot  for  checking 
normality  of  residuals  from  the  GLS  fit. 


sex 
age  *  sex 
age 

treat  *  week 
treat 
week 
twstrsO 


0  50  100  150  200 


X2-df 


Fig.  7.5  Results  of  anova  from  generalized  least  squares  fit  with  continuous  time 
AR1  correlation  structure.  As  expected,  the  baseline  version  of  Y  dominates. 
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Age, years 


Fig.  7.6  Estimated  effects  of  time,  baseline  TWSTRS,  age,  and  sex 


kl  contrast  (a,  list  (  week  =  c  (2 , 4 , 8 , 12 , 16)  ,  treat =' 5000U ')  , 

list  (week  =  c(2,4><3,12,16)  ^  treat  =  'Placebo  ')) 

options (width=80) 
print (kl ,  digits=3) 


week 

twst r sO 

age 

sex 

Contrast 

S  .  E  . 

Lower 

Upper 

Z 

Pr ( > | z  |  ) 

1 

2 

46 

56 

F 

-6 .31 

2 . 10 

-10 . 43 

-2 . 186 

-3 . 00 

0 . 0027 

2 

4 

46 

56 

F 

-5 .91 

1 . 82 

-9.47 

-2 . 349 

-3 . 25 

0 . 0011 

3 

8 

46 

56 

F 

-4.90 

2 .01 

-8 . 85 

-0 . 953 

CO 

CN 

1 

0 .0150 

4* 

12 

46 

56 

F 

-3 . 07 

1 . 75 

-6.49 

0 . 361 

-1.75 

0 . 0795 

5* 

16 

46 

56 

F 

-1 . 02 

2 . 10 

-5 . 14 

3 . 092 

-0.49 

0 . 6260 

Redundant  contrasts  are  denoted  by  * 

Confidence  intervals  are  0.95  individual  intervals 


#  Compare  high  dose  with  placebo 

k2  contrast (a,  list ( week  =  c (2 , 4 , 8 , 12 , 16)  ,  treat =' 10000 U ')  , 

list (week  =  c(2,4>8,12,16)  ^  treat =  'Placebo  ')) 

print (k2 ,  digits=3) 


week 

twst r sO 

age 

sex 

Contrast 

S  .  E  . 

Lower 

Upper 

Z 

Pr ( > | z  |  ) 

1 

2 

46 

56 

F 

-6 . 89 

2 . 07 

-10 . 96 

-2 . 83 

-3 . 32 

0 . 0009 

2 

4 

46 

56 

F 

-6 . 64 

1 . 79 

-10 . 15 

-3 . 13 

-3 . 70 

0 . 0002 

3 

8 

46 

56 

F 

-5.49 

2 . 00 

-9.42 

-1 . 56 

-2 . 74 

0 . 0061 

4* 

12 

46 

56 

F 

-1.76 

1 . 74 

-5.17 

1 . 65 

-1.01 

0 .3109 

5* 

16 

46 

56 

F 

2 . 62 

2 . 09 

-1.47 

6 .71 

1 . 25 

0 . 2099 

Redundant  contrasts  are  denoted  by  * 

Confidence  intervals  are  0.95  individual  intervals 
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kl 

as . dat 

a . frame 

( 

kl  [ c  (  ' 

week 

' ,  ' Contras 

o 

•N 

-P 

wer 

f 

9 

L 

' Upper 

')]) 

pi 

ggplo 

t 

(kl  ,  ae 

s 

( x=wee 

k  ,  y 

=  Contras t ) ) 

+  geom 

-P0: 

mt 

()  + 

geom_ 

1 

ine  ( )  + 

ylab  (  ' 

Low 

Dose  -  Plac 

ebo  '  )  + 

geom_ 

e 

rr orbar 

( 

aes ( ym 

in  =  L 

ower ,  ymax= 

Upper  )  , 

width 

=  0) 

k2 

as . dat 

a . frame 

( 

k2  [  c  (  ' 

week 

'  ,  '  Contras 

t  ' ,  ' Lo 

wer 

f 

9 

' Upper 

')]) 

p2 

ggplo 

t 

(k2  ,  ae 

s 

( x= wee 

k  ,  y 

=  Contras t ) ) 

+  geom 

-P0: 

mt 

()  + 

geom_ 

1 

ine  ( )  + 

ylab  (  ' 

High 

Dose  -  Pla 

cebo  '  ) 

+ 

geom_ 

e 

rr orbar 

( 

aes ( ym 

in  =  L 

ower ,  ymax= 

Upper  )  , 

width 

=  0) 

gri 

lb  H  ■ 

LdExtra  :  : 

g 

r id . arr 

ange (pi 

,  p2 

,  ncol=2) 

#  Figu 

re  7.7 

Fig.  7.7  Contrasts  and  0.95  confidence  limits  from  GLS  fit 


Although  multiple  d.f.  tests  such  as  total  treatment  effects  or  treatment 
x  time  interaction  tests  are  comprehensive,  their  increased  degrees  of  free¬ 
dom  can  dilute  power.  In  a  treatment  comparison,  treatment  contrasts  at 
the  last  time  point  (single  d.f.  tests)  are  often  of  major  interest.  Such  con¬ 
trasts  are  informed  by  all  the  measurements  made  by  all  subjects  (up  until 
dropout  times)  when  a  smooth  time  trend  is  assumed.  They  use  appropriate 
extrapolation  past  dropout  times  based  on  observed  trajectories  of  subjects 
followed  the  entire  observation  period.  In  agreement  with  the  top  left  panel 
of  Figure  7.6,  Figure  7.7  shows  that  the  treatment,  despite  causing  an  early 
improvement,  wears  off  by  16  weeks  at  which  time  no  benefit  is  seen. 

A  nomogram  can  be  used  to  obtain  predicted  values,  as  well  as  to  better 
understand  the  model,  just  as  with  a  univariate  Y. 


n  V-  nomogram(a,  age=c ( seq (20 , 

80  , 

hy  =  10)  , 

85)) 

plot(n,  cex . axi s  =  . 55  ,  cex. var= 

.8, 

lmgp  = . 25  ) 

#  Figure  7.8 
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Fig.  7.8  Nomogram  from  GLS  fit.  Second  axis  is  the  baseline  score. 


7.10  Further  Reading 


Jim  Rochon  (Rho,  Inc.,  Chapel  Hill  NC)  has  the  following  comments  about 
using  the  baseline  measurement  of  Y  as  the  first  longitudinal  response. 


For  RCTs  [randomized  clinical  trials],  I  draw  a  sharp  line  at  the  point 
when  the  intervention  begins.  The  LHS  [left  hand  side  of  the  model  equa¬ 
tion]  is  reserved  for  something  that  is  a  response  to  treatment.  Anything 
before  this  point  can  potentially  be  included  as  a  covariate  in  the  regres¬ 
sion  model.  This  includes  the  “baseline”  value  of  the  outcome  variable. 
Indeed,  the  best  predictor  of  the  outcome  at  the  end  of  the  study  is  typ¬ 
ically  where  the  patient  began  at  the  beginning.  It  drinks  up  a  lot  of 
variability  in  the  outcome;  and,  the  effect  of  other  covariates  is  typically 
mediated  through  this  variable. 

I  treat  anything  after  the  intervention  begins  as  an  outcome.  In  the  west¬ 
ern  scientific  method,  an  “effect”  must  follow  the  “cause”  even  if  by  a  split 
second. 

Note  that  an  RCT  is  different  than  a  cohort  study.  In  a  cohort  study, 
“Time  0”  is  not  terribly  meaningful.  If  we  want  to  model,  say,  the  trend 
over  time,  it  would  be  legitimate,  in  my  view,  to  include  the  “baseline” 
value  on  the  LHS  of  that  regression  model. 


7.10  Further  Reading 
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Now,  even  if  the  intervention,  e.g.,  surgery,  has  an  immediate  effect,  I 
would  include  still  reserve  the  LHS  for  anything  that  might  legitimately 
be  considered  as  the  response  to  the  intervention.  So,  if  we  cleared  a 
blocked  artery  and  then  measured  the  MABP,  then  that  would  still  be 
included  on  the  LHS. 

Now,  it  could  well  be  that  most  of  the  therapeutic  effect  occurred  by 
the  time  that  the  first  repeated  measure  was  taken,  and  then  levels  off. 
Then,  a  plot  of  the  means  would  essentially  be  two  parallel  lines  and  the 
treatment  effect  is  the  distance  between  the  lines,  i.e.,  the  difference  in 
the  intercepts. 

If  the  linear  trend  from  baseline  to  Time  1  continues  beyond  Time  1,  then 
the  lines  will  have  a  common  intercept  but  the  slopes  will  diverge.  Then, 
the  treatment  effect  will  the  difference  in  slopes. 

One  point  to  remember  is  that  the  estimated  intercept  is  the  value  at  time 
0  that  we  predict  from  the  set  of  repeated  measures  post  randomization. 

In  the  first  case  above,  the  model  will  predict  different  intercepts  even 
though  randomization  would  suggest  that  they  would  start  from  the  same 
place.  This  is  because  we  were  asleep  at  the  switch  and  didn’t  record  the 
“action”  from  baseline  to  time  1.  In  the  second  case,  the  model  will  predict 
the  same  intercept  values  because  the  linear  trend  from  baseline  to  time 
1  was  continued  thereafter. 

More  importantly,  there  are  considerable  benefits  to  including  it  as  a  co¬ 
variate  on  the  RHS.  The  baseline  value  tends  to  be  the  best  predictor  of 
the  outcome  post-randomization,  and  this  maneuver  increases  the  preci¬ 
sion  of  the  estimated  treatment  effect.  Additionally,  any  other  prognostic 
factors  correlated  with  the  outcome  variable  will  also  be  correlated  with 
the  baseline  value  of  that  outcome,  and  this  has  two  important  conse¬ 
quences.  First,  this  greatly  reduces  the  need  to  enter  a  large  number  of 
prognostic  factors  as  covariates  in  the  linear  models.  Their  effect  is  already 
mediated  through  the  baseline  value  of  the  outcome  variable.  Secondly, 
any  imbalances  across  the  treatment  arms  in  important  prognostic  factors 
will  induce  an  imbalance  across  the  treatment  arms  in  the  baseline  value 
of  the  outcome.  Including  the  baseline  value  thereby  reduces  the  need  to 
enter  these  variables  as  covariates  in  the  linear  models. 

Stephen  Senn563  states  that  temporally  and  logically,  a  “baseline  cannot  be 
a  response  to  treatment”,  so  baseline  and  response  cannot  be  modeled  in  an 
integrated  framework. 

.  .  .  one  should  focus  clearly  on  ‘outcomes’  as  being  the  only  values  that 
can  be  influenced  by  treatment  and  examine  critically  any  schemes  that 
assume  that  these  are  linked  in  some  rigid  and  deterministic  view  to 
‘baseline’  values.  An  alternative  tradition  sees  a  baseline  as  being  merely 
one  of  a  number  of  measurements  capable  of  improving  predictions  of 
outcomes  and  models  it  in  this  way. 

The  final  reason  that  baseline  cannot  be  modeled  as  the  response  at  time  zero  is 
that  many  studies  have  inclusion/exclusion  criteria  that  include  cutoffs  on  the 
baseline  variable  yielding  a  truncated  distribution.  In  general  it  is  not  appropri¬ 
ate  to  model  the  baseline  with  the  same  distributional  shape  as  the  follow-up 
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2 


3 


measurements.  Thus  the  approach  recommended  by  Liang  and  Zeger405  and 
Liu  et  al.423  are  problematicf. 

Gardiner  et  al.211  compared  several  longitudinal  data  models,  especially  with  re¬ 
gard  to  assumptions  and  how  regression  coefficients  are  estimated.  Peters  et  al.500 
have  an  empirical  study  confirming  that  the  “use  all  available  data”  approach  of 
likelihood— based  longitudinal  models  makes  imputation  of  follow-up  measure¬ 
ments  unnecessary. 

Keselman  et  al.34'  did  a  simulation  study  to  study  the  reliability  of  AIC  for 
selecting  the  correct  covariance  structure  in  repeated  measurement  models.  In 
choosing  from  among  11  structures,  AIC  selected  the  correct  structure  47%  of 
the  time.  Gurka  et  al.247  demonstrated  that  fixed  effects  in  a  mixed  effects 
model  can  be  biased,  independent  of  sample  size,  when  the  specified  covariate 
matrix  is  more  restricted  than  the  true  one. 


f  In  addition  to  this,  one  of  the  paper’s  conclusions  that  analysis  of  covariance  is  not 
appropriate  if  the  population  means  of  the  baseline  variable  are  not  identical  in  the 
treatment  groups  is  arguable563.  See346  for  a  discussion  of423. 


Chapter  8 

Case  Study  in  Data  Reduction 


Recall  that  the  aim  of  data  reduction  is  to  reduce  (without  using  the  outcome) 

the  number  of  parameters  needed  in  the  outcome  model.  The  following  case 

study  illustrates  these  techniques: 

1.  redundancy  analysis; 

2.  variable  clustering; 

3.  data  reduction  using  principal  component  analysis  (PC  A),  sparse  PC  A, 
and  pretransformations; 

4.  restricted  cubic  spline  fitting  using  ordinary  least  squares,  in  the  context 
of  scaling;  and 

5.  scaling/ variable  transformations  using  canonical  variates  and  nonparamet- 
ric  additive  regression. 


8.1  Data 

Consider  the  506-patient  prostate  cancer  dataset  from  Byar  and  Green.8'  The 
data  are  listed  in  [28,  Table  46]  and  are  available  in  ASCII  form  from  StatLib 
(lib.stat.cmu.edu)  in  the  Datasets  area  from  this  book’s  Web  page.  These 
data  were  from  a  randomized  trial  comparing  four  treatments  for  stage  3 
and  4  prostate  cancer,  with  almost  equal  numbers  of  patients  on  placebo  and 
each  of  three  doses  of  estrogen.  Four  patients  had  missing  values  on  all  of  the 
following  variables:  wt,  pf ,  hx,  sbp,  dbp,  ekg,  hg,  bm;  two  of  these  patients 
were  also  missing  sz.  These  patients  are  excluded  from  consideration.  The 
ultimate  goal  of  an  analysis  of  the  dataset  might  be  to  discover  patterns  in 
survival  or  to  do  an  analysis  of  covariance  to  assess  the  effect  of  treatment 
while  adjusting  for  patient  heterogeneity.  See  Chapter  21  for  such  analyses. 
The  data  reductions  developed  here  are  general  and  can  be  used  for  a  variety 
of  dependent  variables. 
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8  Case  Study  in  Data  Reduction 


The  variable  names,  labels,  and  a  summary  of  the  data  are  printed  below. 


require (Hmisc ) 


L 


get Hdat a ( pr o st at e )  #  Download  and  make  prostate  accessible 

#  Convert  an  old  date  format  to  R  format 
prost at e $ sdate  V-  as . Date ( prostate  $ sdate ) 
d  V-  describe (prostate  [2 : 17] ) 
latex (d  ,  f ile  =  '  '  ) 


prostate[2:17 

16  Variables  502  Observations 


stage  :  Stage 

n  missing  unique  Info  Mean 
502  0  2  0.73  3.424 


3  (289,  58°/0),  4  (213,  42°/0) 


rx 

n  missing  unique 
502  0  4 

placebo  (127,  25°/0)  ,  0.2  mg  estrogen  (124,  25°/0) 

1.0  mg  estrogen  (126,  25°/0)  ,  5.0  mg  estrogen  (125,  25°/0) 


dtime  :  Months  of  Follow-up  III.  ill  ill  nl  n.  in  ill.  ii.  Hi  >h  >1.  m.  .>1. ..  ,i.  Ill  I,.  Ilh  >n  ill  II.  h.  .11. 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

502  0  76  1  36.13  1.05  5.00  14.25  34.00  57.75  67.00  71.00 

lowest  :  01234,  highest:  72  73  74  75  76 


status 

n  missing  unique 
502  0  10 

alive  (148,  29°/0)  ,  dead  -  prostatic  ca  (130,  26°/0) 

dead  -  heart  or  vascular  (96,  19°/0)  ,  dead  -  cerebrovascular  (31,  6°/0) 
dead  -  pulmonary  embolus  (14,  3°/0)  ,  dead  -  other  ca  (25,  5°/0) 

dead  -  respiratory  disease  (16,  3°/0) 

dead  -  other  specific  non-ca  (28,  6°/0)  ,  dead  -  unspecified  non-ca  (7,  l°/0) 
dead  -  unknown  cause  (7,  l°/0) 


age  :  Age  in  Years  . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 
501  1  41  1  71.46  56  60  70  73  76  78  80 

lowest  :  48  49  50  51  52,  highest:  84  85  87  88  89 


wt  :  Weight  Index  =  wt(kg)-ht(cm)+200  . ...  ,.i,.  i.ni.  ulillil  hill  lllll  liih  ilh  L.l. ......... ...  ... 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

500  2  67  1  99.03  77.95  82.90  90.00  98.00  107.00  116.00  123.00 

lowest  :  69  71  72  73  74,  highest:  136  142  145  150  152 


8.1  Data 
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pf 

n  missing  unique 
502  0  4 

normal  activity  (450,  90°/0)  ,  in  bed  <  50°/0  daytime  (37,  7°/0) 
in  bed  >  50°/0  daytime  (13,  3°/0)  ,  confined  to  bed  (2,  0°/0) 


hx  :  History  of  Cardiovascular  Disease 

n  missing  unique  Info  Sum  Mean 
502  0  2  0.73  213  0.4243 


sbp  :  Systolic  Blood  Pressure/ 10 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

502  0  18  0.98  14.35  11  12  13  14  16  17  18 

8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  30 

Frequency  1  3  14  27  65  74  98  74  72  34  17  12  3  2  3  1  1  1 

°/0  0  1  3  5  13  15  20  15  14  7  3  2  1  0  1  0  0  0 


dbp  :  Diastolic  Blood  Pressure/ 10 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

502  0  12  0.95  8.149  6  6  7  8  9  10  10 

4  5  6  7  8  9  10  11  12  13  14  18 

Frequency  4  5  43  107  165  94  66  9  5  2  1  1 

°/0  1  1  9  21  33  19  13  2  1  0  0  0 


ekg 

n  missing  unique 
494  8  7 


normal  (168,  34°/0)  ,  benign  (23,  5°/0) 
rhythmic  disturb  &  electrolyte  ch  (51,  10°/0) 

heart  block  or  conduction  def  (26,  5°/0)  ,  heart  strain  (150,  30°/0) 
old  MI  (75,  15°/0),  recent  MI  (1,  0°/0) 


hg  :  Serum  Hemoglobin  (g/lOOml)  . . . iiI.LlIiiJIiIiIiIiIii.mi.I.i.., . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 
502  0  91  1  13.45  10.2  10.7  12.3  13.7  14.7  15.8  16.4 

lowest  :  5.899  7.000  7.199  7.800  8.199 

highest:  17.297  17.500  17.598  18.199  21.199 


sz:  Size  of  Primary  Tumor  (cm2)  ,il ll ll II hi . . . . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 
497  5  55  1  14.63  2.0  3.0  5.0  11.0  21.0  32.0  39.2 

lowest  :  01234,  highest:  54  55  61  62  69 


sg  :  Combined  Index  of  Stage  and  Hist.  Grade 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 
491  11  11  0.96  10.31  8  8  9  10  11  13  13 

5  6  7  8  9  10  11  12  13  14  15 

Frequency  3  8  7  67  137  33  114  26  75  5  16 

°/0  1  2  1  14  28  7  23  5  15  1  3 


164 


8  Case  Study  in  Data  Reduction 


ap  :  Serum  Prostatic  Acid  Phosphatase  I . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

502  0  128  1  12.18  0.300  0.300  0.500  0.700  2.975  21.689  38.470 

lowest  :  0.09999  0.19998  0.29999  0.39996  0.50000 
highest:  316.00000  353.50000  367.00000  596.00000  999.87500 

bm  :  Bone  Metastases 

n  missing  unique  Info  Sum  Mean 

502  0  2  0.41  82  0.1633 

stage  is  defined  by  ap  as  well  as  X-ray  results.  Of  the  patients  in  stage  3, 
0.92  have  ap  <  0.8.  Of  those  in  stage  4,  0.93  have  ap  >  0.8.  Since  stage  can 
be  predicted  almost  certainly  from  ap,  we  do  not  consider  stage  in  some  of 
the  analyses. 


8.2  How  Many  Parameters  Can  Be  Estimated? 

There  are  354  deaths  among  the  502  patients.  If  predicting  survival  time  were 
of  major  interest,  we  could  develop  a  reliable  model  if  no  more  than  about 
354/15  =  24  parameters  were  examined  against  Y  in  unpenalized  modeling. 
Suppose  that  a  full  model  with  no  interactions  is  fitted  and  that  linearity  is 
not  assumed  for  any  continuous  predictors.  Assuming  age  is  almost  linear, 
we  could  fit  a  restricted  cubic  spline  function  with  three  knots.  For  the  other 
continuous  variables,  let  us  use  five  knots.  For  categorical  predictors,  the 
maximum  number  of  degrees  of  freedom  needed  would  be  one  fewer  than 
the  number  of  categories.  For  pf  we  could  lump  the  last  two  categories  since 
the  last  category  has  only  2  patients.  Likewise,  we  could  combine  the  last 
two  levels  of  ekg.  Table  8.1  lists  the  candidate  predictors  with  the  maximum 
number  of  parameters  we  consider  for  each. 


Table  8.1  Degrees  of  freedom  needed  for  predictors 

Predictor:  rx  age  wt  pf  hx  sbp  dbp  ekg  hg  sz  sg  ap  bm 

#  Parameters:  324214  4  544441 


8.3  Redundancy  Analysis 

As  described  in  Section  4.7.1,  it  is  occasionally  useful  to  do  a  rigorous  re¬ 
dundancy  analysis  on  a  set  of  potential  predictors.  Let  us  run  the  algorithm 
discussed  there,  on  the  set  of  predictors  we  are  considering.  We  will  use  a  low 
threshold  (0.3)  for  R 2  for  demonstration  purposes. 


8.3  Redundancy  Analysis 
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# 

Allow 

only  1  d.f.  for  three 

of  the 

predictors 

L 

prostate 

<— 

transform (prostate  , 

ekg. norm  =  l*(ekg 

°/o  i  n  °/0  c  ( 

" normal " , " benign  " 

))  , 

rxn  =  as. numeric  ( 

rx)  , 

pfn  =  as  .  numer i  c  ( pf  )  ) 

# 

Force 

pfn ,  rxn  to  be  linear 

becaus e 

of  difficulty  of 

placing 

# 

knots 

with  so  many  ties  in 

the  data 

# 

Note  : 

all  incomplete  cases 

are  deleted  (inefficient) 

redun 

stage  +  I(rxn)  +  age 

+  wt  +  I  (pfn)  +  hx  + 

sbp  +  dbp  +  ekg. norm  + 

hg  +  sz 

+  sg  +  ap  +  bm , 

r  2 

=.3,  t ype = ' ad j ust ed  '  , 

data=prostate ) 

Redundancy  Analysis 

redun ( f ormula  =  ^stage  +  I(rxn)  +  age  +  wt  +  I(pfn)  +  hx  + 
sbp  +  dbp  +  ekg  .  norm  +  hg  +  sz  +  sg  +  ap  +  bm  , 
data  =  prostate  ,  r2  =  0.3,  type  =  "adjusted") 

n:  483  p :  14  nk  :  3 


Number  of  NAs  :  19 

Frequencies  of  Missing  Values  Due  to  Each  Variable 


stage 

I ( rxn ) 

age 

wt 

I ( pf n ) 

hx 

sbp 

dbp 

0 

0 

1 

2 

0 

0 

0 

0 

ekg . norm 

hg 

sz 

sg 

ap 

bm 

0 

0 

5 

11 

0 

0 

Transformation 

of 

target  variables 

forced  to 

be  linear 

R 2  cutoff  : 

0.3 

Type :  ad j  ust  ed 

R2  with  which 

each 

variable  can 

be 

predicted 

from  all 

other 

variables  : 

stage 

I ( rxn ) 

age 

wt 

I ( pf n ) 

hx 

sbp 

dbp 

0.658 

0. 

000 

0.073  0. 

111 

0. 156 

0.062 

0.452 

0.417 
ekg . norm 

hg 

sz 

sg 

ap 

bm 

0.055 

0. 

146 

0.192  0. 

540 

0. 147 

0.391 

Rendundant  variables  : 


stage  sbp  bm  sg 
Predicted  from  variables : 


I(rxn)  age  wt  I(pfn)  hx  dbp  ekg . norm  hg  sz  ap 
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Variable  Deleted  R2  R 2  after  later  deletions 

1  stage  0.658  0.658  0.646  0.494 

2  sbp  0.452  0.453  0.455 

3  bm  0.374  0.367 

4  sg  0 . 342 


By  any  reasonable  criterion  on  R 2,  none  of  the  predictors  is  redundant,  stage 
can  be  predicted  with  an  R 2  =  0.658  from  the  other  13  variables,  but  only 
with  R2  =  0.493  after  deletion  of  3  variables  later  declared  to  be  “redundant.” 


8.4  Variable  Clustering 

From  Table  8.1,  the  total  number  of  parameters  is  42,  so  some  data  reduction 
should  be  considered.  We  resist  the  temptation  to  take  the  “easy  way  out”  us¬ 
ing  stepwise  variable  selection  so  that  we  can  achieve  a  more  stable  modeling 
process  and  obtain  unbiased  standard  errors.  Before  using  a  variable  cluster¬ 
ing  procedure,  note  that  ap  is  extremely  skewed.  To  handle  skewness,  we  use 
Spearman  rank  correlations  for  continuous  variables  (later  we  transform  each 
variable  using  transcan,  which  will  allow  ordinary  correlation  coefficients  to 
be  used).  After  classifying  ekg  as  “normal/benign”  versus  everything  else,  the 
Spearman  correlations  are  plotted  below. 


X 

wi 

th  ( 

prostate  , 

L 

cbind  (stage  ,  rx  ,  age 

5 
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)) 

# 

u 

no 

mi 

ssing  data ,  could  us 

e 

CO 

r ( app  l 

y  (x. 

2, 

rank ) ) 

r 

rc 

orr 

(x,  type  =  " spearman  "  ) 

$r 

#  rc 

o  rr 

in 

Hmis  c 

maxabsr 

max ( abs (r  [row (r )  !  = 

c 

:ol 

(r)])) 

P 

nrow  ( 

r) 

L 

pi 

ot 

(  c ( - . 35  , p  + 

.  5  )  ,  c  (  .  5 
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#  Figure  8.1 

V 

<— 

dimnames ( 
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xt 

(rep  (  . 

5  ,p) 
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} 

We  perform  a  hierarchical  cluster  analysis  based  on  a  similarity  matrix 
that  contains  pairwise  Hoeffding  D  statistics.295  D  will  detect  nonmonotonic 
associations. 


8.5  Transformation  and  Single  Imputation  Using  transcan 
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vc  V-  varclus (N  stage  +  rxn  +  age  +  wt  +  pfn  +  hx  + 

sbp  +  dbp  +  ekg.norm  +  hg  +  sz  +  sg  +  ap  +  bin  , 
s im= ' hoef f ding ' ,  data=prost at e ) 

Pi  ot(vc)  #  Figure  8.2 

We  combine  sbp  and  dbp,  and  tentatively  combine  ap,  sg,  sz,  and  bm. 


8.5  Transformation  and  Single  Imputation  Using 
transcan 


Now  we  turn  to  the  scoring  of  the  predictors  to  potentially  reduce  the  number 
of  regression  parameters  that  are  needed  later  by  doing  away  with  the  need  for 
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Fig.  8.1  Matrix  of  Spearman  p  rank  correlation  coefficients  between  predictors.  Hor¬ 
izontal  gray  scale  lines  correspond  to  p  =  0.  The  tallest  bar  corresponds  to  \p\  =  0.78. 


nonlinear  terms  and  multiple  dummy  variables.  The  R  Hmisc  package  transcan 
function  defaults  to  using  a  maximum  generalized  variance  method368  that 
incorporates  canonical  variates  to  optimally  transform  both  sides  of  a  mul¬ 
tiple  regression  model.  Each  predictor  is  treated  in  turn  as  a  variable  being 
predicted,  and  all  variables  are  expanded  into  restricted  cubic  splines  (for 
continuous  variables)  or  dummy  variables  (for  categorical  ones). 

#  Combine  2  levels  of  ekg  ( one  had  freq.  1) 
levels  (prostate$ekg)  [levels  (prostate$ekg)  0/0in0/0 

c  (  'old  MI',  'recent  MI')]  <-  'MI' 

prost at e $pf . coded  as . int eger ( pr o s t at e $pf ) 
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Fig.  8.2  Hierarchical  clustering  using  Hoeffding’s  D  as  a  similarity  measure.  Dummy 
variables  were  used  for  the  categorical  variable  ekg.  Some  of  the  dummy  variables 
cluster  together  since  they  are  by  definition  negatively  correlated. 


#  make  a  numeri c  version;  comb  in e  last  2  levels  of  original 
levels  ( prostate $pf  )  levels  (prostate $pf  )  [c (1 , 2 , 3 , 3) ] 

ptrans 

transcan sz  +  sg  +  ap  +  sbp  +  dbp  + 

age  +  wt  +  hg  +  ekg  +  pf  +  bm  +  hx ,  imputed=TRUE , 
transf ormed =TRUE ,  t r ant ab =TRUE ,  pl=FALSE , 
show . na=TRUE ,  dat a=prost at e  ,  frac=.l,  pr=FALSE) 
summary ( ptrans  ,  digits =4) 


transcan (x  =  ~sz  +  sg  +  ap  +  sbp  +  dbp  +  age  +  wt  +  hg  +  ekg  + 

pf  +  bm  +  hx  ,  imputed  =  TRUE,  trantab  =  TRUE,  transformed  =  TRUE, 
pr  =  FALSE,  pi  =  FALSE,  show  .  na  =  TRUE,  data  =  prostate  , 
frac  =  0.1) 

Iterations  :  8 


n 

Ft  achieved  in  predicting  each  variable: 


sz 

sg 

ap 

sbp 

dbp 

age 

wt 

hg 

ekg 

pf 

bm 

hx 

0.207 

0.556 

0.573 

0.498 

0.485 

0.095 

0.122 

0.158 

0.092 

0.113 

0.349 

0. 

108 

Adj  usted  R 2  : 

sz 

sg 

ap 

sbp 

dbp 

age 

wt 

hg 

ekg 

pf 

bm 

hx 

0.180 

0.541 

0.559 

0.481 

0.468 

0.065 

0.093 

0.129 

0.059 

0.086 

0.331 

0. 

083 

Coefficients 

of  canonical  vari 

a  t  e  s 

for  p  r  e 

dieting  each 

(  row  ) 

v  a  r  i  a 

bl 

e 

sz 

sg 

ap 

sbp 

dbp 

age 

wt 

hg 

ekg 

pf 

bm 

sz 

0. 

66 

0. 

20 

0. 

33 

0. 

33 

-0 

.01 

-0 

.01 

0. 

1 1 

0. 

1 1 

0. 

03 

-0 

36 

sg 

0. 

23 

0. 

84 

0. 

08 

0. 

07 

-0 

.02 

0. 

01 

-0 

.01 

-0 

07 

0. 

02 

-0 

20 

ap 

0. 

07 

0. 

80 

-0 

.11 

-0 

.05 

0. 

03 

-0 

.02 

0. 

01 

0. 

01 

0. 

00 

-0 

83 

sbp 

0. 

13 

0. 

10 

-0 

.14 

-0 

.94 

0. 

14 

-0 

.09 

0. 

03 

0. 

10 

0. 

10 

-0 

03 

dbp 

0. 

13 

0. 

09 

-0 

.06 

-0 

.98 

0. 

14 

0. 

07 

0. 

05 

0. 

03 

0. 

04 

0. 

03 

age 

-0 

.02 

-0 

.06 

0. 

18 

0. 

58 

0. 

57 

0. 

14 

0. 

46 

0. 

43 

-0 

.03 

1  . 

05 

wt 

-0 

.02 

0. 

06 

-0 

.08 

-0 

.31 

0. 

23 

0. 

12 

0. 

51 

-0 

06 

0. 

21 

-  1 

09 

hg 

0. 

13 

-0 

.02 

0. 

03 

0. 

09 

0. 

15 

0. 

33 

0. 

43 

-0 

02 

0. 

24 

-  1 

53 

ekg 

0. 

20 

-0 

.38 

0. 

10 

0. 

42 

0. 

12 

0. 

41 

-0 

.04 

-0 

.04 

0. 

15 

-0 

42 

pf 

0. 

04 

0. 

08 

0. 

02 

0. 

36 

0. 

14 

-0 

.03 

0. 

22 

0. 

29 

0. 

13 

-  1 

75 

bm 

-0 

.02 

-0 

.03 

-0 

.13 

0. 

00 

0. 

00 

0. 

03 

-0 

.04 

-0 

.06 

-0 

01 

-0 

.06 

8.5  Transformation  and  Single  Imputation  Using  transcan 


169 


hx  0.04  0.05  -0.01  -0.04  0.00 

hx 

sz  0.34 
sg  0.14 
ap  —0.03 
sbp  -0.14 
dbp  —0.01 
age  —0.76 
wt  0.27 
hg  -0.12 
ekg  —1.23 
p  f  —0.46 
bm  —0.02 
hx 


-0.06  0.02  -0.01  -0.09  -0.04  -0.05 


Summary 

of  imputed  values 

sz 

n 

missing 

unique 

Info 

Mean 

5 

0 

4 

0.95 

12.86 

6  (2, 

40%),  7.416  (1,  20%), 

20.18 

(1,  20%), 

24.69 

(1,  20%) 

sg 

n 

missing 

unique 

Info 

Mean 

.05 

.10 

.25 

11 

0 

10 

1 

10.1 

6. 

900 

7.289 

7 

697 

75 

.90 

.95 

10.560 

15.000 

15.000 

6.511 

7.289  7.394 

8  10. 

25  10.27 

10 

.32  10 

.39  10.73 

15 

Frequency  1 

1 

1 

1 

1  1 

1 

1 

1 

2 

% 

9 

9 

9 

9 

9  9 

9 

9 

9 

18 

age 

n 

missing 

unique 

Info 

Mean 

1 

0 

1 

0 

71.65 

wt 

n 

missing 

unique 

Info 

Mean 

2 

0 

2 

1 

97.77 

91.24 

(1 

,  50%), 

104.3  (1, 

50%) 

ekg 

n 

missing 

unique 

Info 

Mean 

8 

0 

4 

0.9 

2.625 

1  (3,  38%),  3  (3,  38%),  4  (1,  12%),  5  (1,  12%) 

Starting  estimates  for  imputed  values: 

sz  sg  ap  sbp  dbp  age  wt  hg  ekg  pf  bm  hx 

11.0  10.0  0.7  14.0  8.0  73.0  98.0  13.7  1.0  1.0  0.0  0.0 


ggplot (ptrans ,  scale=TRUE)  + 

theme  ( axis . text . x  =  element text  ( size =6) )  #  Figure  8.3 

The  plotted  output  is  shown  in  Figure  8.3.  Note  that  at  face  value  the  trans¬ 
formation  of  ap  was  derived  in  a  circular  manner,  since  the  combined  index 
of  stage  and  histologic  grade,  sg,  uses  in  its  stage  component  a  cutoff  on  ap. 
However,  if  sg  is  omitted  from  consideration,  the  resulting  transformation  for 
ap  does  not  change  appreciably.  Note  that  bm  and  hx  are  represented  as  binary 
variables,  so  their  coefficients  in  the  table  of  canonical  variable  coefficients 
are  on  a  different  scale.  For  the  variables  that  were  actually  transformed,  the 
coefficients  are  for  standardized  transformed  variables  (mean  0,  variance  1). 
From  examining  the  R2 s,  age,  wt,  ekg,  pf,  and  hx  are  not  strongly  related 
to  other  variables.  Imputations  for  age,  wt,  ekg  are  thus  relying  more  on  the 
median  or  modal  values  from  the  marginal  distributions.  From  the  coefficients 
of  first  (standardized)  canonical  variates,  sbp  is  predicted  almost  solely  from 
dbp;  bm  is  predicted  mainly  from  ap,  hg,  and  pf. 
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Fig.  8.3  Simultaneous  transformation  and  single  imputation  of  all  candidate  predic¬ 
tors  using  transcan.  Imputed  values  are  shown  as  red  plus  signs.  Transformed  values 
are  arbitrarily  scaled  to  [0,  1]. 


8.6  Data  Reduction  Using  Principal  Components 

The  first  PC,  PCi,  is  the  linear  combination  of  standardized  variables  having 
maximum  variance.  PC2  is  the  linear  combination  of  predictors  having  the 
second  largest  variance  such  that  PC2  is  orthogonal  to  (uncorrelated  with) 
PCi.  If  there  are  p  raw  variables,  the  first  k  PCs,  where  k  <  p,  will  explain 
only  part  of  the  variation  in  the  whole  system  of  p  variables  unless  one  or 
more  of  the  original  variables  is  exactly  a  linear  combination  of  the  remaining 
variables.  Note  that  it  is  common  to  scale  and  center  variables  to  have  mean 
zero  and  variance  1  before  computing  PCs. 

The  response  variable  (here,  time  until  death  due  to  any  cause)  is  not 
examined  during  data  reduction,  so  that  if  PCs  are  selected  by  variance  ex¬ 
plained  in  the  X-space  and  not  by  variation  explained  in  Y",  one  needn’t 
correct  for  model  uncertainty  or  multiple  comparisons. 

PC  A  results  in  data  reduction  when  the  analyst  uses  only  a  subset  of  the 
p  possible  PCs  in  predicting  Y.  This  is  called  incomplete  principal  component 
regression.  When  one  sequentially  enters  PCs  into  a  predictive  model  in  a 
strict  pre-specified  order  (i.e.,  by  descending  amounts  of  variance  explained 
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for  the  system  of  p  variables),  model  uncertainty  requiring  bootstrap  adjust¬ 
ment  is  minimized.  In  contrast,  model  uncertainty  associated  with  stepwise 
regression  (driven  by  associations  with  Y)  is  massive. 

For  the  prostate  dataset,  consider  PCs  on  raw  candidate  predictors,  ex¬ 
panding  polytomous  factors  using  dummy  variables.  The  R  function  princomp 
is  used,  after  singly  imputing  missing  raw  values  using  transcan’s  optimal 
additive  nonlinear  models.  In  this  series  of  analyses  we  ignore  the  treatment 
variable,  rx. 

■ 

#  Impute  all  missing  values  in  all  variables  given  to  transcan 
imputed  V-  impute (ptrans  ,  dat a  =  pr o st at e  ,  1  i st  .  out  =TRUE ) 


imputed  V-  as . dat a . f r ame ( imput ed ) 

#  Compute  principal  components  on  imputed  data. 

#  Create  a  design  matrix  from  ekg  categories 

Ekg  V-  model .matrix ekg ,  data= imput ed )[ ,  -1] 

#  Use  correlation  matrix 
pfn  V-  prostate$pfn 

prin.raw  V-  princomp  (^  sz  +  sg  +  ap  +  sbp  +  dbp  +  age  + 

wt  +  hg  +  Ekg  +  pfn  +  bm  +  hx  , 
cor=TRUE ,  data= imput ed ) 

plot (prin.raw,  type  ='  lines  '  ,  main=  1  1  ,  ylim  =  c(0,3)) #  Figure  8.4 

#  Add  cumulative  fraction  of  variance  explained 
addscree  V-  function  (x  ,  npcs  =min  (  10  ,  length  (  x$  sdev  ))  , 

plot v  =  FALSE  , 

col=l,  offset=.8,  ad j =0 ,  pr=FALSE)  { 

vars  V-  x$sdevA2 

cumv  V-  cumsum  (  vars  )/ sum  (  vars  ) 
if(pr)  print (cumv) 

text (1: npcs,  vars  [1: npcs]  +  offset  *par('cxy')  [2], 
as . character (round ( cumv  [1 : npcs ]  ,  2)), 

srt  =45 ,  ad j  =  ad j  ,  cex=.65,  xpd  =  NA,  col  =  col) 
if(plotv)  lines  (  1 : npcs  ,  vars  [  1 : npcs ]  ,  type='b',  col  =  col) 

} 

addscree (prin.raw) 

prin.trans  V-  princomp ( ptrans $transf ormed  ,  cor=TRUE) 
adds cree ( pr in . trans  ,  npcs=10,  plotv=TRUE,  col  =  'red  1  , 
offset=-.8,  adj=l) 
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Comp.1  Comp. 3  Comp. 5  Comp. 7  Comp. 9 


Fig.  8.4  Variance  of  the  system  of  raw  predictors  (black)  explained  by  individual 
principal  components  (lines)  along  with  cumulative  proportion  of  variance  explained 
(text),  and  variance  explained  by  components  computed  on  trans can-transformed 
variables  (red) 


The  resulting  plot  shown  in  Figure  8.4  is  called  a  “scree”  plot  [325,  pp.  96-99, 
104,  106].  It  shows  the  variation  explained  by  the  first  k  principal  components 
as  k  increases  all  the  way  to  16  parameters  (no  data  reduction).  It  requires 
10  of  the  16  possible  components  to  explain  >  0.8  of  the  variance,  and  the 
first  5  components  explain  0.49  of  the  variance  of  the  system.  Two  of  the  16 
dimensions  are  almost  totally  redundant. 

After  repeating  this  process  when  transforming  all  predictors  via  transcan, 
we  have  only  12  degrees  of  freedom  for  the  12  predictors.  The  variance  ex¬ 
plained  is  depicted  in  Figure  8.4  in  red.  It  requires  at  least  9  of  the  12  possible 
components  to  explain  >  0.9  of  the  variance,  and  the  first  5  components  ex¬ 
plain  0.66  of  the  variance  as  opposed  to  0.49  for  untransformed  variables. 

Let  us  see  how  the  PCs  “explain”  the  times  until  death  using  the  Cox  re¬ 
gression  function  from  rms,  cph,  described  in  Chapter  20.  In  what  follows 
we  vary  the  number  of  components  used  in  the  Cox  models  from  1  to  all  16, 
computing  the  AIC  for  each  model.  AIC  is  related  to  model  log  likelihood 
penalized  for  number  of  parameters  estimated,  and  lower  is  better.  For  refer¬ 
ence,  the  AIC  of  the  model  using  ah  of  the  original  predictors,  and  the  AIC 
of  a  full  additive  spline  model  are  shown  as  horizontal  lines. 
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For  the  money,  the  first  5  components  adequately  summarizes  all  variables, 
if  linearly  transformed,  and  the  full  linear  model  is  no  better  than  this.  The 
model  allowing  all  continuous  predictors  to  be  nonlinear  is  not  worth  its 
added  degrees  of  freedom. 

Next  check  the  performance  of  a  model  derived  from  cluster  scores  of 
transformed  variables. 
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Number  of  Components  Used 


Fig.  8.5  AIC  of  Cox  models  fitted  with  progressively  more  principal  components. 
The  solid  blue  line  depicts  the  AIC  of  the  model  with  all  original  covariates.  The 
dotted  blue  line  is  positioned  at  the  AIC  of  the  full  spline  model. 


[1]  3954.393 


print  (f  ,  latex=TRUE  ,  long  =  FALSE  ,  title=  '  '  ) 


Model  Tests 

Discrimination 

Indexes 

Obs  502 
Events  354 
Center  0 

LR  Y2  81.11 

d.f.  7 

Pr(>  x2)  0.0000 
Score  x2  86.81 
Pr(>  x2)  0.0000 

K2  0.149 

Dxy  0.286 

g  0.562 

gr  E755 

Coef  S.E.  Wald  Z  Pr(>  \Z\) 


tumor 

-0.1723  0.0367 

-4.69 

<  0.0001 

bp 

-0.0251  0.0424 

-0.59 

0.5528 

cardiac 

-0.2513  0.0516 

-4.87 

<  0.0001 

hg 

-0.1407  0.0554 

-2.54 

0.0111 

age 

-0.1034  0.0579 

-1.79 

0.0739 

pf 

-0.0933  0.0487 

-1.92 

0.0551 

wt 

-0.0910  0.0555 

-1.64 

0.1012 

The  tumor  and  cardiac  clusters  seem  to  dominate  prediction  of  mortality, 
and  the  AIC  of  the  model  built  from  cluster  scores  of  transformed  variables 
compares  favorably  with  other  models  (Figure  8.5). 
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8.6.1  Sparse  Principal  Components 


A  disadvantage  of  principal  components  is  that  every  predictor  receives  a 
nonzero  weight  for  every  component,  so  many  coefficients  are  involved  even 
through  the  effective  degrees  of  freedom  with  respect  to  the  response  model 
are  reduced.  Sparse  principal  components 672  uses  a  penalty  function  to  reduce 
the  magnitude  of  the  loadings  variables  receive  in  the  components.  If  an  LI 
penalty  is  used  (as  with  the  lasso),  some  loadings  are  shrunk  to  zero,  result¬ 
ing  in  some  simplicity.  Sparse  principal  components  combines  some  elements 
of  variable  clustering,  scoring  of  variables  within  clusters,  and  redundancy 
analysis. 

Filzmoser,  Fritz,  and  Kalcher191  have  written  a  nice  R  package  pcaPP  for 
doing  sparse  PC  analysis. a  The  following  example  uses  the  prostate  data 
again.  To  allow  for  nonlinear  transformations  and  to  score  the  ekg  variable 
in  the  prostate  dataset  down  to  a  scalar,  we  use  the  trans can-transformed 
predictors  as  inputs. 


require (pcaPP) 

L 

s  <—  sPCAgri 

d ( ptrans $ transf ormed  ,  k=10 ,  method 

=  '  sd  '  , 

center=mean,  scale=sd,  scores =TRUE  , 

maxit er  =  10) 

plot ( s ,  type 

=  '  lines  '  ,  main= 1  1  ,  ylim  =  c(0,3)) 

#  Figure  8.6 

adds  cree  (  s ) 

s  $  loadings 

#  These  loadings  are  on  the  orig. 

trans can  scale 

Loadings  : 

Comp .  1  Comp 

.  2  Comp  .  3  Comp  .  4  Comp  .  5  Comp  .  6  Comp  .  7  Comp 

.  8  Comp  .  9  Comp  .  1  0 

sz  0.248 

0.950 

sg  0.620 

0.522 

ap  0.634 

-0.305 

sbp  — 0.707 

dbp  0.707 

age 

1.000 

wt 

1.000 

hg 

1.000 

ekg 

1.000 

Pf 

1.000 

bm  —0.391 

0.852 

hx 

1.000 

Comp  .  1  Comp  .  2  Comp  .  3  Comp  .  4  Comp  .  5  Comp  .  6 

Comp  .  7  Comp  .  8 

SS  loadings 

1.000  1.000  1.000  1.000  1.000  1.000 

1.000  1.000 

Proportion  Var 

0.083  0.083  0.083  0.083  0.083  0.083 

0.083  0.083 

Cumulative  Var 

0.083  0.167  0.250  0.333  0.417  0.500 

0.583  0.667 

Comp  .  9  Comp  .  1  0 

SS  loadings 

1.000  1.000 

Proportion  Var 

0.083  0.083 

Cumulative  Var 

0.750  0.833 

Only  nonzero  loadings  are  shown.  The  first  sparse  PC  is  the  tumor  cluster 
used  above,  and  the  second  is  the  blood  pressure  cluster.  Let  us  see  how  well 
incomplete  sparse  principal  component  regression  predicts  time  until  death. 


a  The  spca  package  is  a  new  sparse  PC  package  that  should  also  be  considered. 
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Fig.  8.6  Variance  explained  by  individual  sparse  principal  components  (lines)  along 
with  cumulative  proportion  of  variance  explained  (text) 
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More  components  are  required  to  optimize  AIC  than  were  seen  in  Figure  8.5, 
but  a  model  built  from  6-8  sparse  PCs  performed  as  well  as  the  other  models. 


8.7  Transformation  Using  Nonparametric  Smoothers 

The  ACE  nonparametric  additive  regression  method  of  Breiman  and  Fried¬ 
man  3  transforms  both  the  left-hand-side  variable  and  all  the  right-hand-side 
variables  so  as  to  optimize  R 2.  ACE  can  be  used  to  transform  the  predic¬ 
tors  using  the  R  ace  function  in  the  acepack  package,  called  by  the  transace 
function  in  the  Hmisc  package,  transace  does  not  impute  data  but  merely 
does  casewise  deletion  of  missing  values.  Here  transace  is  run  after  single  im¬ 
putation  by  transcan.  binary  is  used  to  tell  transace  which  variables  not  to 
try  to  predict  (because  they  need  no  transformation).  Several  predictors  are 
restricted  to  be  monotonically  transformed. 
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Fig.  8.7  Performance  of  sparse  principal  components  in  Cox  models 
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categorical="ekg" ,  binary=c(" bm " , " hx " ) ) 


n 

R  achieved  in  predicting  each  variable: 

dbp  age  wt 

0.4580924  0.1514527  0.1732244 
hx 
NA 


sz 

0.2265824 

hg 

0.2001008 


sg 

0.5762743 

ekg 

0.1 110709 


ap 

0.5717747 

pf 

0.1778705 


sbp 

0.4823852 

bm 

NA 


Except  for  ekg,  age,  and  for  arbitrary  sign  reversals,  the  transformations  in 
Figure  8.8  determined  using  transace  were  similar  to  those  in  Figure  8.3.  The 
transcan  transformation  for  ekg  makes  more  sense. 
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Sauerbrei  and  Schumacher541  used  the  bootstrap  to  demonstrate  the  variability 
of  a  standard  variable  selection  procedure  for  the  prostate  cancer  dataset. 
Schemper  and  Heinze551  used  logistic  models  to  impute  dichotomizations  of  the 
predictors  for  this  dataset. 
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Fig.  8.8  Simultaneous  transformation  of  all  variables  using  ACE. 


8.9  Problems 

The  Mayo  Clinic  conducted  a  randomized  trial  in  primary  biliary  cirrhosis 
(PBC)  of  the  liver  between  January  1974  and  May  1984,  to  compare  D- 
penicillamine  with  placebo.  The  drug  was  found  to  be  ineffective  [197,  p. 
2],  and  the  trial  was  done  before  liver  transplantation  was  common,  so  this 
trial  constitutes  a  natural  history  study  for  PBC.  Followup  continued  through 
July,  1986.  For  the  19  patients  that  did  undergo  transplant,  followup  time  was 
censored  (status=o)  at  the  day  of  transplant.  312  patients  were  randomized, 
and  another  106  patients  were  entered  into  a  registry.  The  nonrandomized 
patients  have  most  of  their  laboratory  values  missing,  except  for  bilirubin, 
albumin,  and  prothrombin  time.  28  randomized  patients  had  both  serum 
cholesterol  and  triglycerides  missing.  The  data,  which  consist  of  clinical,  bio¬ 
chemical,  serologic,  and  histologic  information,  are  listed  in  [197,  pp.  359- 
375].  The  PBC  data  are  discussed  and  analyzed  in  [197,  pp.  2-7,  102-104, 
153-162],  [158],  [7]  (a  tree-based  analysis  which  on  its  p.  480  mentions  some 
possible  lack  of  fit  of  the  earlier  analyses),  and  [361].  The  data  are  stored  in 
the  datasets  web  site  so  may  be  accessed  using  the  Hmisc  getHdata  function 
with  argument  pbc.  Use  only  the  data  on  randomized  patients  for  all  analyses. 
For  Problems  1-6,  ignore  followup  time,  status,  and  drug. 
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1.  Do  an  initial  variable  clustering  based  on  ranks,  using  pairwise  deletion  of 
missing  data.  Comment  on  the  potential  for  one-dimensional  summaries  of 
subsets  of  variables  being  adequate  summaries  of  prognostic  information. 

2.  cholesterol,  triglycerides,  platelets,  and  copper  are  missing  on  some  pa¬ 
tients.  Impute  them  using  a  method  you  recommend.  Use  some  or  all  of 
the  remaining  predictors  and  possibly  the  outcome.  Provide  a  correlation 
coefficient  describing  the  usefulness  of  each  imputation  model.  Provide 
the  actual  imputed  values,  specifying  observation  numbers.  For  all  later 
analyses,  use  imputed  values  for  missing  values. 

3.  Perform  a  scaling/ transformat  ion  analysis  to  better  measure  how  the  pre¬ 
dictors  interrelate  and  to  possibly  pretransform  some  of  them.  Use  transcan 
or  ACE.  Repeat  the  variable  clustering  using  the  transformed  scores  and 
Pearson  correlation  or  using  an  oblique  rotation  principal  component  anal¬ 
ysis.  Determine  if  the  correlation  structure  (or  variance  explained  by  the 
first  principal  component)  indicates  whether  it  is  possible  to  summarize 
multiple  variables  into  single  scores. 

4.  Do  a  principal  component  analysis  of  all  transformed  variables  simulta¬ 
neously.  Make  a  graph  of  the  number  of  components  versus  the  cumula¬ 
tive  proportion  of  explained  variation.  Repeat  this  for  laboratory  variables 
alone. 

5.  Repeat  the  overall  PC  A  using  sparse  principal  components.  Pay  atten¬ 
tion  to  how  best  to  solve  for  sparse  components,  e.g.,  consider  the  lambda 
parameter  in  sPCAgrid. 

6.  How  well  can  variables  (lab  and  otherwise)  that  are  routinely  collected 
(on  nonrandomized  patients)  capture  the  information  (variation)  of  the 
variables  that  are  often  missing?  It  would  be  helpful  to  explore  the  strength 
of  interrelationships  by 

a.  correlating  two  PCiS  obtained  from  untransformed  variables, 

b.  correlating  two  PCiS  obtained  from  transformed  variables, 

c.  correlating  the  best  linear  combination  of  one  set  of  variables  with  the 
best  linear  combination  of  the  other  set,  and 

d.  doing  the  same  on  transformed  variables. 

For  this  problem  consider  only  complete  cases,  and  transform  the  5  non¬ 
numeric  categorical  predictors  to  binary  0-1  variables. 

7.  Consider  the  patients  having  complete  data  who  were  randomized  to 
placebo.  Consider  only  models  that  are  linear  in  all  the  covariates. 

a.  Fit  a  survival  model  to  predict  time  of  death  using  the  following  covari¬ 
ates:  bili,  albumin,  stage,  protime,  age,  alk.phos,  sgot,  chol,  trig, 
platelet,  copper. 

b.  Perform  an  ordinary  principal  component  analysis.  Fit  the  survival 
model  using  only  the  first  3  PCs.  Compare  the  likelihood  ratio  y2  and 
AIC  with  that  of  the  model  using  the  original  variables. 
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c.  Considering  the  PCs  are  fixed,  use  the  bootstrap  to  estimate  the  0.95 
confidence  interval  of  the  inter-quartile-range  age  effect  on  the  original 
scale,  and  the  same  type  of  confidence  interval  for  the  coefficient  of  PCi. 

d.  Now  accounting  for  uncertainty  in  the  PCs,  compute  the  same  two 
confidence  intervals.  Compare  and  interpret  the  two  sets.  Take  into 
account  the  fact  that  PCs  are  not  unique  to  within  a  sign  change. 

R  programming  hints  for  this  exercise  are  found  on  the  course  web  site. 


Chapter  9 

Overview  of  Maximum  Likelihood 
Estimation 


9.1  General  Notions — Simple  Cases 

In  ordinary  least  squares  multiple  regression,  the  objective  in  fitting  a  model 
is  to  find  the  values  of  the  unknown  parameters  that  minimize  the  sum  of 
squared  errors  of  prediction.  When  the  response  variable  is  non-normal,  poly- 
tomous,  or  not  observed  completely,  one  needs  a  more  general  objective  func¬ 
tion  to  optimize. 

Maximum  likelihood  (ML)  estimation  is  a  general  technique  for  estimat¬ 
ing  parameters  and  drawing  statistical  inferences  in  a  variety  of  situations, 
especially  nonstandard  ones.  Before  laying  out  the  method  in  general,  ML 
estimation  is  illustrated  with  a  standard  situation,  the  one-sample  binomial 
problem.  Here,  independent  binary  responses  are  observed  and  one  wishes  to 
draw  inferences  about  an  unknown  parameter,  the  probability  of  an  event  in 
a  population. 

Suppose  that  in  a  population  of  individuals,  each  individual  has  the  same 
probability  P  that  an  event  occurs.  We  could  also  say  that  the  event  has 
already  been  observed,  so  that  P  is  the  prevalence  of  some  condition  in  the 
population.  For  each  individual,  let  Y  =  1  denote  the  occurrence  of  the 
event  and  Y  —  0  denote  nonoccurrence.  Then  Prob{T  =  1}  =  P  for  each 
individual.  Suppose  that  a  random  sample  of  size  3  from  the  population  is 
drawn  and  that  the  first  individual  had  Y  —  1,  the  second  had  Y  =  0,  and  the 
third  had  Y  =  1.  The  respective  probabilities  of  these  outcomes  are  P,  1  —  P, 
and  P.  The  joint  probability  of  observing  the  independent  events  Y  =  1,  0, 1 
is  P(1  —  P)P  =  P2(l  —  P).  Now  the  value  of  P  is  unknown,  but  we  can  solve 
for  the  value  of  P  that  makes  the  observed  data  (Y  =  1,0,1)  most  likely 
to  have  occurred.  In  this  case,  the  value  of  P  that  maximizes  P2(l  —  P)  is 
P  =  2/3.  This  value  for  P  is  the  maximum  likelihood  estimate  ( MLE )  of  the 
population  probability. 
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Let  us  now  study  the  situation  of  independent  binary  trials  in  general.  Let 
the  sample  size  be  n  and  the  observed  responses  be  Yi,  Y2, . . . ,  Yn.  The  joint 
probability  of  observing  the  data  is  given  by 

n 

L  =  Y[PY(1-P)1~Y.  (9.1) 

i= 1 

Now  let  s  denote  the  sum  of  the  Ys  or  the  number  of  times  that  the  event 
occurred  (Yi  =  1),  that  is  the  number  of  “successes.”  The  number  of  non¬ 
occurrences  (“failures”)  is  n  —  s.  The  likelihood  of  the  data  can  be  simplified 
to 

L  =  Ps(l  —  P)n~s .  (9.2) 

It  is  easier  to  work  with  the  log  likelihood  function ,  which  also  has  desirable 
statistical  properties.  For  the  one-sample  binary  response  problem,  the  log 
likelihood  is 

log  L  =  s  log(P)  +  (n  —  s)  log(l  —  P ).  (9.3) 

The  MLE  of  P  is  that  value  of  P  that  maximizes  L  or  logL.  Since  logL 
is  a  smooth  function  of  P,  its  maximum  value  can  be  found  by  finding  the 
point  at  which  logL  has  a  slope  of  0.  The  slope  or  first  derivative  of  logL, 
with  respect  to  P,  is 

U(P)  =  <9 log L/dP  =  s/P  -  (n  -  s)/(  1  -  P).  (9.4) 

The  first  derivative  of  the  log  likelihood  function  with  respect  to  the  parame¬ 
ter  (s),  here  P(P),  is  called  the  score  function.  Equating  this  function  to  zero 
requires  that  s/P  =  (n  —  s)/(l  —  P).  Multiplying  both  sides  of  the  equation 
by  P(1  —  P)  yields  s(  1  —  P)  =  (n  —  s)P  or  that  s  =  (n  —  s)P  +  sP  =  nP. 
Thus  the  MLE  of  P  is  p  =  s/n. 

Another  important  function  is  called  the  Fisher  information  about  the 
unknown  parameters.  The  information  function  is  the  expected  value  of  the 
negative  of  the  curvature  in  logL,  which  is  the  negative  of  the  slope  of  the 
slope  as  a  function  of  the  parameter,  or  the  negative  of  the  second  derivative 
of  logL.  Motivation  for  consideration  of  the  Fisher  information  is  as  follows. 
If  the  log  likelihood  function  has  a  distinct  peak,  the  sample  provides  infor¬ 
mation  that  allows  one  to  readily  discriminate  between  a  good  parameter 
estimate  (the  location  of  the  obvious  peak)  and  a  bad  one.  In  such  a  case  the 
MLE  will  have  good  precision  or  small  variance.  If  on  the  other  hand  the  like¬ 
lihood  function  is  relatively  flat,  almost  any  estimate  will  do  and  the  chosen 
estimate  will  have  poor  precision  or  large  variance.  The  degree  of  peakedness 
of  a  function  at  a  given  point  is  the  speed  with  which  the  slope  is  changing  at 
that  point,  that  is,  the  slope  of  the  slope  or  second  derivative  of  the  function 
at  that  point. 


9.1  General  Notions — Simple  Cases 


183 


Here,  the  information  is 

I(P)  =  E{-d2  log  L/dP2} 

=  E{s/P2  +  (n- s)/(l- P)2}  (9.5) 

=  nP/P 2  +  n{  1  -  P)/(  1  -  P )2  =  n/[P(l  -  P)]. 

We  estimate  the  information  by  substituting  the  MLE  of  P  into  I (P),  yielding 
I(p)  =  n/[p{l  —  p)\. 

Figures  9.1,  9.2,  and  9.3  depict,  respectively,  logL,  U(P),  and  /(P),  all 
as  a  function  of  P.  Three  combinations  of  n  and  s  were  used  in  each  graph. 
These  combinations  correspond  to  p  =  .5,  .6,  and  .6,  respectively. 


Fig.  9.1  log  likelihood  functions  for  three  one-sample  binomial  problems 


In  each  case  it  can  be  seen  that  the  value  of  P  that  makes  the  data  most 
likely  to  have  occurred  (the  value  that  maximizes  L  or  logL)  is  p  given 
above.  Also,  the  score  function  (slope  of  log  L)  is  zero  at  P  =  p.  Note  that 
the  information  function  I(P)  is  highest  for  P  approaching  0  or  1  and  is 
lowest  for  P  near  .5,  where  there  is  maximum  uncertainty  about  P.  Note 
also  that  while  logL  has  the  same  shape  for  the  s  =  60  and  s  =  12  curves 
in  Figure  9.1,  the  range  of  logL  is  much  greater  for  the  larger  sample  size. 
Figures  9.2  and  9.3  show  that  the  larger  sample  size  produces  a  sharper 
likelihood.  In  other  words,  with  larger  n,  one  can  zero  in  on  the  true  value  of 
P  with  more  precision. 
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Fig.  9.2  Score  functions  ( dL/dP ) 


P 

Fig.  9.3  Information  functions  (— d2  log L/dP2) 


In  this  binary  response  one-sample  example  let  us  now  turn  to  inference 
about  the  parameter  P.  First,  we  turn  to  the  estimation  of  the  variance  of  the 
MLE,  p.  An  estimate  of  this  variance  is  given  by  the  inverse  of  the  information 
at  P  =  p: 
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Var(p)=I(p )  1  =  p(l  —  p)/n.  (9.6) 

Note  that  the  variance  is  smallest  when  the  information  is  greatest  (p  =  0 
or  1). 

The  variance  estimate  forms  a  basis  for  confidence  limits  on  the  unknown 
parameter.  For  large  n,  the  MLE  p  is  approximately  normally  distributed 
with  expected  value  (mean)  P  and  variance  P(  1  —  P)/n.  Since  p{  1  —  p)  is  a 
consistent  estimate  of  P(1  —  P)/n,  it  follows  that  p  ±  z[p(  1  —  p)/n]1^2  is  an 
approximate  1  —  a  confidence  interval  for  P  if  z  is  the  1  —  a/2  critical  value 
of  the  standard  normal  distribution. 


9.2  Hypothesis  Tests 

Now  let  us  turn  to  hypothesis  tests  about  the  unknown  population  parameter 
P  —  Hq  :  P  =  Pq.  There  are  three  kinds  of  statistical  tests  that  arise  from 
likelihood  theory. 


9.2.1  Likelihood  Ratio  Test 

This  test  statistic  is  the  ratio  of  the  likelihood  at  the  hypothesized  parameter 
values  to  the  likelihood  of  the  data  at  the  maximum  (i.e.,  at  parameter  values 
=  MLEs).  It  turns  out  that  — 2x  the  log  of  this  likelihood  ratio  has  desirable 
statistical  properties.  The  likelihood  ratio  test  statistic  is  given  by 

LR  =  — 21og(L  at  Hq/L  at  MLEs) 

=  — 2(logL  at  H0)  —  [— 2(logL  at  MLEs)].  (9.7) 

The  LR  statistic,  for  large  enough  samples,  has  approximately  a  y2  distribu¬ 
tion  with  degrees  of  freedom  equal  to  the  number  of  parameters  estimated,  if 
the  null  hypothesis  is  “simple,”  that  is,  doesn’t  involve  any  unknown  param¬ 
eters.  Here  LR  has  1  d.f. 

The  value  of  log  L  at  Ho  is 

log L(H0)  =  slog(Po)  +  (n  -  s)log(l  -  P0).  (9.8) 

The  maximum  value  of  log  L  (at  MLEs)  is 


logP(P  =  p)  =  s  log(p)  +  (n  —  s)  log(l  —  p). 


(9.9) 
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For  the  hypothesis  Hq  :  P  =  Pq,  the  test  statistic  is 

LR  =  -2{s\og(P0/p)  +  (n-  s)log[(l  -  P0)/(l-p)]}.  (9.10) 

Note  that  when  p  happens  to  equal  Po,  LR  =  0.  When  p  is  far  from  Po,  LR  will 
be  large.  Suppose  that  Po  =  1/2,  so  that  Hq  is  P  =  1/2.  For  n  =  100,  s  =  50, 
LR  =  0.  For  n  =  100,  s  =  60, 

LR=  — 2{60  log(.5/.6)  +  40  log(.5/.4)}  =  4.03.  (9.11) 

For  n  =  20,  s  =  12, 

LR  =  — 2{12  log(.5/.6)  +  8  log(.5/.4)}  =  .81  =  4.03/5.  (9.12) 

Therefore,  even  though  the  best  estimate  of  P  is  the  same  for  these  two  cases, 
the  test  statistic  is  more  impressive  when  the  sample  size  is  five  times  larger. 


9.2.2  Wald  Test 


The  Wald  test  statistic  is  a  generalization  of  a  t-  or  z-statistic.  It  is  a  function 
of  the  difference  in  the  MLE  and  its  hypothesized  value,  normalized  by  an 
estimate  of  the  standard  deviation  of  the  MLE.  Here  the  statistic  is 


W  =  [p  -  P0]2/[p(l  —p)/n 


(9.13) 


For  large  enough  n,  W  is  distributed  as  y2  with  1  d.f.  For  n  =  100,  s  =  50, 
W  =  0.  For  the  other  samples,  W  is,  respectively,  4.17  and  0.83  (note  0.83  = 
4.17/5). 

Many  statistical  packages  treat  \/W  as  having  a  t  distribution  instead  of 
a  normal  distribution.  As  pointed  out  by  Gould,228  there  is  no  basis  for  this 
outside  of  ordinary  linear  modelsa. 


9.2.3  Score  Test 

If  the  MLE  happens  to  equal  the  hypothesized  value  Po,  Po  maximizes  the 
likelihood  and  so  U (Po)  =  0.  Rao’s  score  statistic  measures  how  far  from  zero 
the  score  function  is  when  evaluated  at  the  null  hypothesis.  The  score  function 


a  In  linear  regression,  a  t  distribution  is  used  to  penalize  for  the  fact  that  the  variance 
of  Y\X  is  estimated.  In  models  such  as  the  logistic  model,  there  is  no  separate  vari¬ 
ance  parameter  to  estimate.  Gould  has  done  simulations  that  show  that  the  normal 
distribution  provides  more  accurate  P-values  than  the  t  for  binary  logistic  regression. 
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(slope  or  first  derivative  of  log  L)  is  normalized  by  the  information  (curvature 
or  second  derivative  of  —  logL).  The  test  statistic  for  our  example  is 

5  =  P(P0)2//(P0),  (9.14) 

which  formally  does  not  involve  the  MLE,  p.  The  statistic  can  be  simplified 
as  follows. 


U(P0 )  =  s/P0  -  (n  -  a)/(l  -  P0) 
/(Po)  =  s/P02  +  (n  -  s)/(  1  -  P0)2 
S  =  (s-  nP0)2/[nP0(l  -  P0)] 


n(p-P0)2/[Po(l 


(9.15) 

Po)]- 


Note  that  the  numerator  of  S  involves  s  —  nPo,  the  difference  between  the 
observed  number  of  successes  and  the  number  of  successes  expected  under  Hq. 

As  with  the  other  two  test  statistics,  S  =  0  for  the  first  sample.  For  the 
last  two  samples  S  is,  respectively,  4  and  .8  =  4/5. 


9.2. 4  Normal  Distribution — One  Sample 

Suppose  that  a  sample  of  size  n  is  taken  from  a  population  for  a  random 
variable  Y  that  is  known  to  be  normally  distributed  with  unknown  mean 
p  and  variance  a2.  Denote  the  observed  values  of  the  random  variable  by 
Yi,  Y2, .  •  • ,  Yn.  Now  unlike  the  binary  response  case  (Y  =  0  or  1),  we  cannot 
use  the  notion  of  the  probability  that  Y  equals  an  observed  value.  This  is 
because  Y  is  continuous  and  the  probability  that  it  will  take  on  a  given  value 
is  zero.  We  substitute  the  density  function  for  the  probability.  The  density 
at  a  point  y  is  the  limit  as  d  approaches  zero  of 

Prob{?/  <  Y  <  y  +  d}/d  =  [F{y  +  d)  —  F(y)\/d,  (9.16) 

where  F(y)  is  the  normal  cumulative  distribution  function  (for  a  mean  of  (i 
and  variance  of  a2).  The  limit  of  the  right-hand  side  of  the  above  equation  as 
d  approaches  zero  is  f(y ),  the  density  function  of  a  normal  distribution  with 
mean  fi  and  variance  a2.  This  density  function  is 

f(y)  =  (2t rcr2)_1/2  exp{— (y  -  y)2/2cr2}.  (9.17) 

The  likelihood  of  observing  the  observed  sample  values  is  the  joint  density 
of  the  Y s.  The  log  likelihood  function  here  is  a  function  of  two  unknowns,  (i 
and  a2. 

n 

log  L  =  —.5 n  log(27 rcr2)  —  .5  ^~^( Y{  —  (i)2 /a2 .  (9.18) 

2=1 
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It  can  be  shown  that  the  value  of  /i  that  maximizes  log  L  is  the  value  that  min¬ 
imizes  the  sum  of  squared  deviations  about  /x,  which  is  the  sample  mean  Y. 
The  MLE  of  ex2  is 

n 

s2  =  ]T(V-F)2/n.  (9.19) 

i— 1 


Recall  that  the  sample  variance  uses  n  —  1  instead  of  n  in  the  denominator.  It 
can  be  shown  that  the  expected  value  of  the  MLE  of  cr2,  s2,  is  [(n  —  l)/n]cr2; 
in  other  words,  s2  is  too  small  by  a  factor  of  (n  —  l)/n  on  the  average.  The 
sample  variance  is  unbiased,  but  being  unbiased  does  not  necessarily  make 
it  a  better  estimator.  The  MLE  has  greater  precision  (smaller  mean  squared 
error)  in  many  cases. 


9.3  General  Case 

Suppose  we  need  to  estimate  a  vector  of  unknown  parameters  B  =  {L>i,  , 

. . . ,  Bp}  from  a  sample  of  size  n  based  on  observations  Yi, . . . ,  Yn.  Denote  the 
probability  or  density  function  of  the  random  variable  Y  for  the  xth  observa¬ 
tion  by  fi(y ;  B).  The  likelihood  for  the  xth  observation  is  Li(B)  =  fi(Yi ;  B). 
In  the  one-sample  binary  response  case,  recall  that  Li(B)  =  Li(P)  = 
PYi[  1  —  P]1_yb  The  likelihood  function,  or  joint  likelihood  of  the  sample, 
is  given  by 

n 

L(B)  =  fj  B).  (9.20) 

i—  1 

The  log  likelihood  function  is 


n 

log  L(B)  =  log  Li(B).  (9.21) 

i— 1 

The  MLE  of  B  is  that  value  of  the  vector  B  that  maximizes  log  L(B)  as 
a  function  of  B.  In  general,  the  solution  for  B  requires  iterative  trial- and- 
error  methods  as  outlined  later.  Denote  the  MLE  of  B  as  b  =  {6i, . . . ,  bp}. 
The  score  vector  is  the  vector  of  first  derivatives  of  logL(£>)  with  respect  to 
B\ , . . . ,  Bp . 


U{B)  =  {d/dB1\ogL(B),...,d/dBp\ogL(B)} 

=  (d/dB)\ogL(B).  (9.22) 

The  Fisher  information  matrix  is  the  p  x  p  matrix  whose  elements  are  the 
negative  of  the  expectation  of  all  second  partial  derivatives  of  log  L(B): 

I*{B)  =  -{E[(d2logL(B)/dB3dBk)}}PxP 

=  -E{(d2/dBdB')\ogL(B)}.  (9.23) 
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The  observed  information  matrix  1(B)  is  I*(B)  without  taking  the  expecta¬ 
tion.  In  other  words,  observed  values  remain  in  the  second  derivatives: 

I(B)  =  -(d2/dBdB')\ogL(B).  (9.24) 

This  information  matrix  is  often  estimated  from  the  sample  using  the  es¬ 
timated  observed  information  7(6),  by  inserting  6,  the  MLE  of  B ,  into  the 
formula  for  1(B). 

Under  suitable  conditions,  which  are  satisfied  for  most  situations  likely 
to  be  encountered,  the  MLE  b  for  large  samples  is  an  optimal  estimator 
(has  as  great  a  chance  of  being  close  to  the  true  parameter  as  all  other 
types  of  estimators)  and  has  an  approximate  multivariate  normal  distribution 
with  mean  vector  B  and  variance-covariance  matrix  /*_1(7>),  where  C~l 
denotes  the  inverse  of  the  matrix  C.  (C~l  is  the  matrix  such  that  C~lC  is 
the  identity  matrix,  a  matrix  with  ones  on  the  diagonal  and  zeros  elsewhere. 
If  C  is  a  1  x  1  matrix,  C  1  =  1/C.)  A  consistent  estimator  of  the  variance- 
covariance  matrix  is  given  by  the  matrix  V,  obtained  by  inserting  b  for  B  in 
1(B)  :  V  =  I-^b)  . 


9.3.1  Global  Test  Statistics 

Suppose  we  wish  to  test  the  null  hypothesis  Hq  :  B  =  7>° .  The  likelihood 
ratio  test  statistic  is 


LR  =  — 21og(L  at  Hq/L  at  MLEs) 

=  -2 [log L(B°)  -  log L(b)}.  (9.25) 

The  corresponding  Wald  test  statistic,  using  the  estimated  observed  informa¬ 
tion  matrix,  is 

W  =  (b-  B°)'I(b)(b  -  B°)  =  (b-  B°)'V~1(b  -  B°).  (9.26) 

(A  quadratic  form  a'Va  is  a  matrix  generalization  of  a2V.)  Note  that  if  the 
number  of  estimated  parameters  is  p  =  1,  W  reduces  to  (b  —  B0)2 /V  ,  which 
is  the  square  of  a  z-  or  t-type  statistic  (estimate  —  hypothesized  value  divided 
by  estimated  standard  deviation  of  estimate). 

The  score  statistic  for  Hq  is 

S  =  Uf(B°)r1(B°)U(B°).  (9.27) 

Note  that  as  before,  S  does  not  require  solving  for  the  MLE.  For  large  samples, 
LR ,  W,  and  S  have  a  y2  distribution  with  p  d.f.  under  suitable  conditions. 
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9.3.2  Testing  a  Subset  of  the  Parameters 

Let  B  =  {L>i,L>2}  and  suppose  that  we  wish  to  test  Hq  :  B\  =  B\.  We 
are  treating  as  a  nuisance  parameter.  For  example,  we  may  want  to  test 
whether  blood  pressure  and  cholesterol  are  risk  factors  after  adjusting  for 
confounders  age  and  sex.  In  that  case  B\  is  the  pair  of  regression  coefficients 
for  blood  pressure  and  cholesterol  and  B 2  is  the  pair  of  coefficients  for  age 
and  sex.  B2  must  be  estimated  to  allow  adjustment  for  age  and  sex,  although 
£>2  is  a  nuisance  parameter  and  is  not  of  primary  interest. 

Let  the  number  of  parameters  of  interest  be  k  so  that  B\  is  a  vector  of 
length  k.  Let  the  number  of  “nuisance”  or  “adjustment”  parameters  be  g,  the 
length  of  L>2  (note  k  +  q  =  p). 

Let  62  be  the  MLE  of  B2  under  the  restriction  that  B\  =  B®.  Then  the 
likelihood  ratio  statistic  is 

LR  =  — 2[logL  at  Hq  —  log L  at  MLE].  (9.28) 

Now  log  L  at  H0  is  more  complex  than  before  because  Hq  involves  an  unknown 
nuisance  parameter  B2  that  must  be  estimated,  log  L  at  Hq  is  the  maximum 
of  the  likelihood  function  for  any  value  of  B2  but  subject  to  the  condition 
that  B\  =  B®.  Thus 


LR  =  — 2[log £(B°,  b*2)  -  log  L{b)],  (9.29) 

where  as  before  b  is  the  overall  MLE  of  B.  Note  that  LR  requires  maximiz¬ 
ing  two  log  likelihood  functions.  The  first  component  of  LR  is  a  restricted 
maximum  likelihood  and  the  second  component  is  the  overall  or  unrestricted 
maximum. 

LR  is  often  computed  by  examining  successively  more  complex  models  in 
a  stepwise  fashion  and  calculating  the  increment  in  likelihood  ratio  y2  in  the 
overall  model.  The  LR  y2  for  testing  Hq  :  B2  =  0  when  B\  is  not  in  the 
model  is 


LR(H0  :  B2 


— 2[log  L(0, 0)  —  log  L(0,  6J) 


(9.30) 


Here  we  are  specifying  that  B\  is  not  in  the  model  by  setting  B\  =  B®  =  0, 
and  we  are  testing  Hq  :  B2  =  0.  (We  are  also  ignoring  nuisance  parameters 
such  as  an  intercept  term  in  the  test  for  £>2  =  0.) 

The  LR  y2  for  testing  Hq  :  B\  =  £>2  =  0  is  given  by 


LR(H0  :  B\  —  L>2  =  0)  =  — 2[logL(0,0)  —  log L(b)].  (9.31) 


Subtracting  LR  y2  for  the  smaller  model  from  that  of  the  larger  model  yields 


— 2[log L(0, 0)  -  log L(b)\ - 2 [log L(0, 0)  -  logL(0,62*I 

— 2[log  L(0,  62)  -logL(&)],  (9.32) 


which  is  the  same  as  above  (letting  B®  =  0). 
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Table  9.1  Example  tests 


Variables  (Parameters) 

LR  x2 

Number  of 

in  Model 

Parameters 

Intercept,  age 

1000 

2 

Intercept,  age,  age2 

1010 

3 

Intercept,  age,  age2,  sex 

1013 

4 

For  example,  suppose  successively  larger  models  yield  the  LR  y2s  in 
Table  9.1.  The  LR  y2  for  testing  for  linearity  in  age  (not  adjusting  for  sex) 
against  quadratic  alternatives  is  1010  —  1000  =  10  with  1  d.f.  The  LR  y2 
for  testing  the  added  information  provided  by  sex,  adjusting  for  a  quadratic 
effect  of  age,  is  1013  —  1010  =  3  with  1  d.f.  The  LR  y2  for  testing  the  joint  im¬ 
portance  of  sex  and  the  nonlinear  (quadratic)  effect  of  age  is  1013  —  1000  =  13 
with  2  d.f. 

To  derive  the  Wald  statistic  for  testing  Hq  :  B\  =  B®  with  B2  being  a 
nuisance  parameter,  let  the  MLE  6  be  partitioned  into  b  =  {61,62}-  We  can 
likewise  partition  the  estimated  variance-covariance  matrix  V  into 


Vn  v12 

V{2  V22 


(9.33) 


The  Wald  statistic  is 


w  =  (b1-B01)'V1-11(b1-B01),  (9.34) 

which  when  k  =  1  reduces  to  (estimate  —  hypothesized  value)2/  estimated 
variance,  with  the  estimates  adjusted  for  the  parameters  in  B2 . 

The  score  statistic  for  testing  Hq  :  B\  =  B /  does  not  require  solving  for 
the  full  set  of  unknown  parameters.  Only  the  MLEs  of  B2  must  be  computed, 
under  the  restriction  that  B\  —  B /.  This  restricted  MLE  is  6J  from  above. 
Let  £/(£>/,  62)  denote  the  vector  of  first  derivatives  of  logL  with  respect  to 
all  parameters  in  L>,  evaluated  at  the  hypothesized  parameter  values  B /  for 
the  first  k  parameters  and  at  the  restricted  MLE  6 2  for  the  last  q  parameters. 
(Since  the  last  q  estimates  are  MLEs,  the  last  q  elements  of  U  are  zero,  so 
the  formulas  that  follow  simplify.)  Let  I(B /,  6J)  be  the  observed  information 
matrix  evaluated  at  the  same  values  of  B  as  is  U .  The  score  statistic  for 
testing  Hq  :  B\  =  B®  is 

5  =  b*2)r\B°1:  b*2).  (9.35) 

Under  suitable  conditions,  the  distribution  of  LR ,  W,  and  S  can  be  ade¬ 
quately  approximated  by  a  y2  distribution  with  k  d.f. 
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9.3.3  Tests  Based  on  Contrasts 

Wald  tests  are  also  done  by  setting  up  a  general  linear  contrast.  Hq  :  CB  =  0 
is  tested  by  a  Wald  statistic  of  the  form 

W  =  ( Cby(CVC')-\Cb ),  (9.36) 

where  C  is  a  contrast  matrix  that  “picks  off”  the  proper  elements  of  B.  The 
contrasts  can  be  much  more  general  by  allowing  elements  of  C  to  be  other 
than  zero  and  one.  For  the  normal  linear  model,  W  is  converted  to  an  F- 
statistic  by  dividing  by  the  rank  r  of  C  (normally  the  number  of  rows  in 
C),  yielding  a  statistic  with  an  F-distribution  with  r  numerator  degrees  of 
freedom. 

Many  interesting  contrasts  are  tested  by  forming  differences  in  predicted 
values.  By  forming  more  contrasts  than  are  really  needed,  one  can  develop 
a  surprisingly  flexible  approach  to  hypothesis  testing  using  predicted  values. 
This  has  the  major  advantage  of  not  requiring  the  analyst  to  account  for  how 
the  predictors  are  coded.  Suppose  that  one  wanted  to  assess  the  difference 
in  two  vectors  of  predicted  values,  X\b  —  X^b  —  (Xi  —  X^ )b  =  Ab  to  test 
Hq  :  AB  =  0,  where  A  =  X\  —  X<i-  The  covariance  matrix  for  Ab  is  given  by 

var  (Ab)  =  AV  A' .  (9.37) 

Let  r  be  the  rank  of  var(Z\6),  i.e.,  the  number  of  non-linearly-dependent 
(non-redundant)  differences  of  predicted  values  of  A.  The  value  of  r  and  the 
rows  of  A  that  are  not  redundant  may  easily  be  determined  using  the  QR 
decomposition  as  done  by  the  R  function  qrb.  The  y2  statistic  with  r  degrees 
of  freedom  (or  F-statistic  upon  dividing  the  statistic  by  r)  may  be  obtained  by 
computing  A*V*  A*'  where  A*  is  the  subset  of  elements  of  A  corresponding 
to  non-redundant  contrasts  and  V *  is  the  corresponding  sub-matrix  of  V. 

The  “difference  in  predictions”  approach  can  be  used  to  compare  means 
in  a  30  year  old  male  with  a  40  year  old  female0.  But  the  true  utility  of 
the  approach  is  most  obvious  when  the  contrast  involves  multiple  nonlinear 
terms  for  a  single  predictor,  e.g.,  a  spline  function.  To  test  for  a  difference 
in  two  curves,  one  can  compare  predictions  at  one  predictor  value  against 
predictions  at  a  series  of  values  with  at  least  one  value  that  pertains  to  each 
basis  function.  Points  can  be  placed  between  every  pair  of  knots  and  beyond 
the  outer  knots,  or  just  obtain  predictions  at  100  equally  spaced  X- values. 


b  For  example,  in  a  3-treatment  comparison  one  could  examine  contrasts  between 
treatments  A  and  B,  A  and  C,  and  B  and  C  by  obtaining  predicted  values  for  those 
treatments,  even  though  only  two  differences  are  required. 

The  rms  command  could  be  contrast  (fit ,  list  (sex= ’male 5  ,  age=30)  , 
list (sex= 5 female 5  ,age=40))  where  all  other  predictors  are  set  to  medians  or 
modes. 


9.3  General  Case 
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Suppose  that  there  are  three  treatment  groups  (A,  B,  C)  interacting  with  a 
cubic  spline  function  of  A.  If  one  wants  to  test  the  multiple  degree  of  freedom 
hypothesis  that  the  profile  for  A  is  the  same  for  treatment  A  and  B  vs.  the 
alternative  hypothesis  that  there  is  a  difference  between  A  and  B  for  at  least 
one  value  of  A,  one  can  compare  predicted  values  at  treatment  A  and  a  vector 
of  A  values  against  predicted  values  at  treatment  B  and  the  same  vector  of 
A  values.  If  the  A  relationship  is  linear,  any  two  A  values  will  suffice,  and 
if  A  is  quadratic,  any  three  points  will  suffice.  It  would  be  difficult  to  test 
complex  hypotheses  involving  only  2  of  3  treatments  using  other  methods. 

The  contrast  function  in  rms  can  estimate  a  wide  variety  of  contrasts  and 
make  joint  tests  involving  them,  automatically  computing  the  number  of  non¬ 
linear  ly-dependent  contrasts  as  the  test’s  degrees  of  freedom.  See  its  help  file 
for  several  examples. 


9.3.4  Which  Test  Statistics  to  Use  When 

At  this  point,  one  may  ask  why  three  types  of  test  statistics  are  needed.  The 
answer  lies  in  the  statistical  properties  of  the  three  tests  as  well  as  in  com¬ 
putational  expense  in  different  situations.  From  the  standpoint  of  statistical 
properties,  LR  is  the  best  statistic,  followed  by  S  and  W.  The  major  sta¬ 
tistical  problem  with  W  is  that  it  is  sensitive  to  problems  in  the  estimated 
variance-covariance  matrix  in  the  full  model.  For  some  models,  most  notably 
the  logistic  regression  model,278  the  variance-covariance  estimates  can  be  too 
large  as  the  effects  in  the  model  become  very  strong,  resulting  in  values  of 
W  that  are  too  small  (or  significance  levels  that  are  too  large).  W  is  also 
sensitive  to  the  way  the  parameter  appears  in  the  model.  For  example,  a  test 
of  Hq  :  log  odds  ratio  =  0  will  yield  a  different  value  of  W  than  will  Hq  : 
odds  ratio  =  1. 

Relative  computational  efficiency  of  the  three  types  of  tests  is  also  an  issue. 
Computation  of  LR  and  W  requires  estimating  all  p  unknown  parameters, 
and  in  addition  LR  requires  re-estimating  the  last  q  parameters  under  that 
restriction  that  the  first  k  parameters  =  B®.  Therefore,  when  one  is  contem¬ 
plating  whether  a  set  of  parameters  should  be  added  to  a  model,  the  score 
test  is  the  easiest  test  to  carry  out.  For  example,  if  one  were  interested  in 
testing  all  two-way  interactions  among  4  predictors,  the  score  test  statistic 
for  Hq  :  “no  interactions  present”  could  be  computed  without  estimating  the 
4  x  3/2  =  6  interaction  effects.  S  would  also  be  appealing  for  testing  linearity 
of  effects  in  a  model — the  nonlinear  spline  terms  could  be  tested  for  signifi¬ 
cance  after  adjusting  for  the  linear  effects  (with  estimation  of  only  the  linear 
effects).  Only  parameters  for  linear  effects  must  be  estimated  to  compute 
A,  resulting  in  fewer  numerical  problems  such  as  lack  of  convergence  of  the 
Newton-Raphson  algorithm. 
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Table  9.2  Choice  of  test  statistics 


Type  of  Test 

Recommended  Test  Statistic 

Global  association 

LR  (S  for  large  no.  parameters) 

Partial  association 

W  (LR  or  S  if  problem  with  W) 

Lack  of  fit,  1  d.f. 

W  or  S 

Lack  of  fit,  >  1  d.f. 

S 

Inclusion  of  additional  predictors 

S 

The  Wald  tests  are  very  easy  to  make  after  all  the  parameters  in  a  model 
have  been  estimated.  Wald  tests  are  thus  appealing  in  a  multiple  regression 
setup  when  one  wants  to  test  whether  a  given  predictor  or  set  of  predic¬ 
tors  is  “significant.”  A  score  test  would  require  re-estimating  the  regression 
coefficients  under  the  restriction  that  the  parameters  of  interest  equal  zero. 

Likelihood  ratio  tests  are  used  often  for  testing  the  global  hypothesis  that 
no  effects  are  significant,  as  the  log  likelihood  evaluated  at  the  MLEs  is  al¬ 
ready  available  from  fitting  the  model  and  the  log  likelihood  evaluated  at 
a  “null  model”  (e.g.,  a  model  containing  only  an  intercept)  is  often  easy  to 
compute.  Likelihood  ratio  tests  should  also  be  used  when  the  validity  of  a 
Wald  test  is  in  question  as  in  the  example  cited  above. 

Table  9.2  summarizes  recommendations  for  choice  of  test  statistics  for 
various  situations. 


9.3.5  Example:  Binomial — Comparing  Two 
Proportions 

Suppose  that  a  binary  random  variable  Y\  represents  responses  for  population 
1  and  Y2  represents  responses  for  population  2.  Let  Pi  =  Prob{Ti  =  1} 
and  assume  that  a  random  sample  has  been  drawn  from  each  population 
with  respective  sample  sizes  n\  and  n 2.  The  sample  values  are  denoted  by 
Yi  1, . .  .,Yin.,i  =  1  or  2.  Let 


n  1  n  2 

Si  =  yWjj  s2  =  'Y^Y2j,  (9.38) 

3= 1  J=1 

the  respective  observed  number  of  “successes”  in  the  two  samples.  Let  us  test 
the  null  hypothesis  Hq  :  Pi  =  P2  based  on  the  two  samples. 

The  likelihood  function  is 
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2 

=  Y[PtSi(l-  Pt)n'-Si  (9.39) 

i— 1 
2 

log L  =  y^{sjlog(a)  +  ( rii  -  Si)  log(l  -  Pi)}.  (9.40) 

i—  1 

Under  Hq,Pi  =  P2  =  P,  so 

logL(iLo)  =  slog(P)  +  (n  -  s)  log(l  -  P),  (9.41) 

where  s  =  s\  +  s2,  n  =  rii  +  n2.  The  (restricted)  MLE  of  this  common  P  is 
p  =  s/n  and  logL  at  this  value  is  s\og(p)  +  (n  —  s)log(l  —  p). 

Since  the  original  unrestricted  log  likelihood  function  contains  two  terms 
with  separate  parameters,  the  two  parts  may  be  maximized  separately  giving 
MLEs 

Pi  =  si/rii  and  P2  =  S2/ri2.  (9.42) 

log  L  evaluated  at  these  (unrestricted)  MLEs  is 

log L  =  silog(pi)  +  (n  1  -  si)log(l  -  pi) 

+  52  log(p2)  +  (n2  -  s2)  log(l  -  p2).  (9.43) 

The  likelihood  ratio  statistic  for  testing  Hq  :  Pi  =  P2  is  then 

LR  =  —2 {s  log(p)  +  (n  —  s)  log(l  —  p) 

-  [51  log(pi)  +  (ni  -  si)  log(l  -  pi)  (9.44) 

+  52  log (p2)  +  (n2  ~  s2 )  log(l  -  p2)]}. 

This  statistic  for  large  enough  n\  and  n2  has  a  y2  distribution  with  1  d.f. 
since  the  null  hypothesis  involves  the  estimation  of  one  fewer  parameter  than 
does  the  unrestricted  case.  This  LR  statistic  is  the  likelihood  ratio  y2  statistic 
for  a  2  x  2  contingency  table.  It  can  be  shown  that  the  corresponding  score 
statistic  is  equivalent  to  the  Pearson  y2  statistic.  The  better  LR  statistic  can 
be  used  routinely  over  the  Pearson  y2  for  testing  hypotheses  in  contingency 
tables. 


9.4  Iterative  ML  Estimation 

In  most  cases,  one  cannot  explicitly  solve  for  MLEs  but  must  use  trial- and- 
error  numerical  methods  to  solve  for  parameter  values  B  that  maximize 
log  L(B)  or  yield  a  score  vector  U(B)  =  0.  One  of  the  fastest  and  most  ap¬ 
plicable  methods  for  maximizing  a  function  is  the  Newton-Raphson  method, 
which  is  based  on  approximating  U(B)  by  a  linear  function  of  B  in  a  small 
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region.  A  starting  estimate  b°  of  the  MLE  b  is  made.  The  linear  approximation 
(a  first-order  Taylor  series  approximation) 

U(b)  =  U(b°)  -  I(b°)(b  -  b°)  (9.45) 

is  equated  to  0  and  solved  by  b  yielding 

b  =  b°  +  J-1(6°)J7(6°).  (9.46) 

The  process  is  continued  in  like  fashion.  At  the  zth  step  the  next  estimate  is 
obtained  from  the  previous  estimate  using  the  formula 

biJr  1  =  bl  +  r\bi)U(bi).  (9.47) 

If  the  log  likelihood  actually  worsened  at  5*+1,  “step  halving”  is  used;  b 2+1 
is  replaced  with  (bl  +  b%Jrl) / 2.  Further  step  halving  is  done  if  the  log  like¬ 
lihood  still  is  worse  than  the  log  likelihood  at  6%  after  which  the  original 
iterative  strategy  is  resumed.  The  Newton-Raphson  iterations  continue  until 
the  —2  log  likelihood  changes  by  only  a  small  amount  over  the  previous  iter¬ 
ation  (say  .025).  The  reasoning  behind  this  stopping  rule  is  that  estimates  of 
B  that  change  the  —2  log  likelihood  by  less  than  this  amount  do  not  affect 
statistical  inference  since  —2  log  likelihood  is  on  the  y2  scale. 


9.5  Robust  Estimation  of  the  Covariance  Matrix 


The  estimator  for  the  covariance  matrix  of  b  found  in  Section  9.3  assumes  that 
the  model  is  correctly  specified  in  terms  of  distribution,  regression  assump¬ 
tions,  and  independence  assumptions.  The  model  may  be  incorrect  in  a  va¬ 
riety  of  ways  such  as  non-independence  (e.g.,  repeated  measurements  within 
subjects),  lack  of  fit  (e.g.,  omitted  covariable,  incorrect  covariable  transfor¬ 
mation,  omitted  interaction),  and  distributional  (e.g.,  Y  has  a  r  distribution 
instead  of  a  normal  distribution).  Variances  and  covariances,  and  hence  con¬ 
fidence  intervals  and  Wald  tests,  will  be  incorrect  when  these  assumptions 
are  violated. 

For  the  case  in  which  the  observations  are  independent  and  identically 
distributed  but  other  assumptions  are  possibly  violated,  Huber312  provided 
a  covariance  matrix  estimator  that  is  consistent.  His  “sandwich”  estimator  is 
given  by 

n 


H 


I 


(9.48) 


where  1(b)  is  the  observed  information  matrix  (Equation  9.24)  and  U{  is  the 
vector  of  derivatives,  with  respect  to  all  parameters,  of  the  log  likelihood 
component  for  the  it h  observation  (assuming  the  log  likelihood  can  be  par¬ 
titioned  into  per-observation  contributions).  For  the  normal  multiple  linear 
regression  case,  H  was  derived  by  White:659 
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n 


(v,v)-1[^(yi  -  XibfXiX'i  ]{x'x) 


(9.49) 


i— 1 


where  X  is  the  design  matrix  (including  an  intercept  if  appropriate)  and  Xi 
is  the  vector  of  predictors  (including  an  intercept)  for  the  ith  observation. 
This  covariance  estimator  allows  for  any  pattern  of  variances  oiY\X  across 
observations.  Note  that  even  though  H  improves  the  bias  of  the  covariance 
matrix  of  6,  it  may  actually  have  larger  mean  squared  error  than  the  ordinary 
estimate  in  some  cases  due  to  increased  variance.164, 529 

When  observations  are  dependent  within  clusters,  and  the  number  of  ob¬ 
servations  within  clusters  is  very  small  in  comparison  to  the  total  sample 
size,  a  simple  adjustment  to  Equation  9.48  can  be  used  to  derive  appro¬ 
priate  covariance  matrix  estimates  (see  Lin  [407,  p.  2237],  Rogers,529  and 
Lee  et  al.  [393,  Eq.  5.1,  p.  246]).  One  merely  accumulates  sums  of  elements  of 
U  within  clusters  before  computing  cross-product  terms: 


n7 


n7 


He  =  J-'WEttE  Vi)(E  Uis)'}]!-1®, 


(9.50) 


i= 1  3  =  1 


3  = 1 


where  c  is  the  number  of  clusters,  rq  is  the  number  of  observations  in  the  ith 
cluster,  Uij  is  the  contribution  of  the  jth  observation  within  the  ith  cluster  to 
the  score  vector,  and  1(b)  is  computed  as  before  ignoring  clusters.  For  a  model 
such  as  the  Cox  model  which  has  no  per-observation  score  contributions, 
special  score  residuals393,407,410,605  are  used  for  U . 

Bootstrapping  can  also  be  used  to  derive  robust  covariance  matrix  esti¬ 
mates177, 1/1  in  many  cases,  especially  if  covariances  of  b  that  are  not  condi¬ 
tional  on  X  are  appropriate.  One  merely  generates  approximately  200  samples 
with  replacement  from  the  original  dataset,  computes  200  sets  of  parameter 
estimates,  and  computes  the  sample  covariance  matrix  of  these  parameter  es¬ 
timates.  Sampling  with  replacement  from  entire  clusters  can  be  used  to  derive 
variance  estimates  in  the  presence  of  intracluster  correlation.188  Bootstrap 
estimates  of  the  conditional  variance-covariance  matrix  given  X  are  harder 
to  obtain  and  depend  on  the  model  assumptions  being  satisfied.  The  simpler 
unconditional  estimates  may  be  more  appropriate  for  many  non-experimental 
studies  where  one  may  desire  to  “penalize”  for  the  X  being  random  variables. 
It  is  interesting  that  these  unconditional  estimates  may  be  very  difficult  to  ob¬ 
tain  parametrically,  since  a  multivariate  distribution  may  need  to  be  assumed 
for  X. 

The  previous  discussion  addresses  the  use  of  a  “working  independence 
model”  with  clustered  data.  Here  one  estimates  regression  coefficients  assum¬ 
ing  independence  of  all  records  (observations).  Then  a  sandwich  or  bootstrap 
method  is  used  to  increase  standard  errors  to  reflect  some  redundancy  in  the 
correlated  observations.  The  parameter  estimates  will  often  be  consistent  es¬ 
timates  of  the  true  parameter  values,  but  they  may  be  inefficient  for  certain 
cluster  or  correlation  structures. 
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The  rms  package’s  robcov  function  computes  the  Huber  robust  covariance 
matrix  estimator,  and  the  bootcov  function  computes  the  bootstrap  covariance 
estimator.  Both  of  these  functions  allow  for  clustering. 


9.6  Wald,  Score,  and  Likelihood-Based  Confidence 
Intervals 

A  1  —  a  confidence  interval  for  a  parameter  is  the  set  of  all  values  (3d 
that  if  hypothesized  would  be  accepted  in  a  test  of  Hq  :  =  /??  at  the 

a  level.  What  test  should  form  the  basis  for  the  confidence  interval?  The 
Wald  test  is  most  frequently  used  because  of  its  simplicity.  A  two-sided  1  —  a 
confidence  interval  is  bi±Zi_a/2s ,  where  z  is  the  critical  value  from  the  normal 
distribution  and  s  is  the  estimated  standard  error  of  the  parameter  estimate 
bi.d  The  problem  with  s  discussed  in  Section  9.3.4  points  out  that  Wald 
statistics  may  not  always  be  a  good  basis.  Wald-based  confidence  intervals  are 
also  symmetric  even  though  the  coverage  probability  may  not  be.160  Score- 
and  LR-based  confidence  limits  have  definite  advantages.  When  Wald- type 
confidence  intervals  are  appropriate,  the  analyst  may  consider  insertion  of 
robust  covariance  estimates  (Section  9.5)  into  the  confidence  interval  formulas 
(note  that  adjustments  for  heterogeneity  and  correlated  observations  are  not 
available  for  score  and  LR  statistics). 

Wald-  (asymptotic  normality)  based  statistics  are  convenient  for  deriving 
confidence  intervals  for  linear  or  more  complex  combinations  of  the  model’s 
parameters.  As  in  Equation  9.36,  the  variance-covariance  matrix  of  (76,  where 
C  is  an  appropriate  matrix  and  6  is  the  vector  of  parameter  estimates,  is 
CV (77,  where  V  is  the  variance  matrix  of  6.  In  regression  models  we  commonly 
substitute  a  vector  of  predictors  (and  optional  intercept)  for  C  to  obtain  the 
variance  of  the  linear  predictor  A 6  as 

var(X6)  =  XV  X' .  (9.51) 

See  Section  9.3.3  for  related  information. 


d  This  is  the  basis  for  confidence  limits  computed  by  the  R  rms  package’s  Predict, 
summary,  and  contrast  functions.  When  the  robcov  function  has  been  used  to  replace 
the  information-matrix-based  covariance  matrix  with  a  Huber  robust  covariance  esti¬ 
mate  with  an  optional  cluster  sampling  correction,  the  functions  are  using  a  “robust” 
Wald  statistic  basis.  When  the  bootcov  function  has  been  used  to  replace  the  model 
fit’s  covariance  matrix  with  a  bootstrap  unconditional  covariance  matrix  estimate, 
the  two  functions  are  computing  confidence  limits  based  on  a  normal  distribution  but 
using  more  nonparametric  covariance  estimates. 
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9.6.1  Simultaneous  Wald  Confidence  Regions 

The  confidence  intervals  just  discussed  are  pointwise  confidence  intervals. 
For  OLS  regression  there  are  methods  for  computing  confidence  intervals 
with  exact  simultaneous  confidence  coverage  for  multiple  estimates374.  There 
are  approximate  methods  for  simultaneous  confidence  limits  for  all  models 
for  which  the  vector  of  estimates  b  is  approximately  multivariately  normally 
distributed.  The  method  of  Hothorn  et  al.30  is  quite  general;  in  their  R 
package  multcomp’s  glht  function,  the  user  can  specify  any  contrast  matrix  over 
which  the  individual  confidence  limits  will  be  simultaneous.  A  special  case 
of  a  contrast  matrix  is  the  design  matrix  X  itself,  resulting  in  simultaneous 
confidence  bands  for  any  number  of  predicted  values.  An  example  is  shown 
in  Figure  9.5.  See  Section  9.3.3  for  a  good  use  for  simultaneous  contrasts. 
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A  more  nonparametric  method  for  computing  confidence  intervals  for  func¬ 
tions  of  the  vector  of  parameters  B  can  be  based  on  bootstrap  percentile 
confidence  limits.  For  each  sample  with  replacement  from  the  original  dataset, 
one  computes  the  MLE  of  5,  6,  and  then  the  quantity  of  interest  g(b).  Then 
the  gs  are  sorted  and  the  desired  quantiles  are  computed.  At  least  1000  boot¬ 
strap  samples  will  be  needed  for  accurate  assessment  of  outer  confidence 
limits.  This  method  is  suitable  for  obtaining  pointwise  confidence  bands  for 
a  nonlinear  regression  function,  say,  the  relationship  between  age  and  the  log 
odds  of  disease.  At  each  of  100  age  values  the  predicted  logits  are  computed 
for  each  bootstrap  sample.  Then  separately  for  each  age  point  the  0.025  and 
0.975  quantiles  of  1000  estimates  of  the  logit  are  computed  to  derive  a  0.95 
confidence  band.  Other  more  complex  bootstrap  schemes  will  achieve  some¬ 
what  greater  accuracy  of  confidence  interval  coverage, 1/8  and  as  described 
in  Section  9.5  one  can  use  variations  on  the  basic  bootstrap  in  which  the 
predictors  are  considered  fixed  and/or  cluster  sampling  is  taken  into  account. 
The  R  function  bootcov  in  the  rms  package  bootstraps  model  fits  to  obtain 
unconditional  (with  respect  to  predictors)  bootstrap  distributions  with  or 
without  cluster  sampling,  bootcov  stores  the  matrix  of  bootstrap  regression 
coefficients  so  that  the  bootstrapped  quantities  of  interest  can  be  computed 
in  one  sweep  of  the  coefficient  matrix  once  bootstrapping  is  completed. 

For  many  regression  models,  the  rms  package’s  Predict,  summary,  and 
contrast  functions  make  it  easy  to  compute  pointwise  bootstrap  confidence 
intervals  in  a  variety  of  contexts.  As  an  example,  consider  200  simulated 
x  values  from  a  log-normal  distribution  and  simulate  binary  y  from  a  true 
population  binary  logistic  model  given  by 
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Prob(y  =  1\X  =  x)  = 


1 


1  +  exp[— (1  +  x/2)\ 


(9.52) 


Not  knowing  the  true  model,  a  quadratic  logistic  model  is  fitted.  The  R  code 
needed  to  generate  the  data  and  fit  the  model  is  given  below. 

L 

require ( rms ) 


n  ^ —  200 
set . seed  (15) 
xl  V-  rnorm (n) 
logit  V-  xl/2 

y  V-  if  else  ( runif  (n)  <  plogi  s  (  logit  )  ,  1,  0) 

dd  V-  dat adi s t ( xl ) ;  opt i ons ( dat adi st =  '  dd  '  ) 
f  <-  lrm(y  ~  pol(xl,2),  x=TRUE ,  y=TRUE ) 
print (f ,  latex=TRUE) 


Logistic  Regression  Model 
lrm(formula  =  y  pol(xl,  2) ,  x  =  TRUE,  y  =  TRUE) 


Model  Likelihood 
Ratio  Test 

Discrimination 

Indexes 

Rank  Discrim. 
Indexes 

Obs  200 

0  97 

1  103 

max  3  x  10-9 

LR  y2  16.37 

d.f.  2 

Pr(>  x2)  0.0003 

K2  0.105 

g  0.680 

gr  1.973 

gp  0.156 

Brier  0.231 

C  0.642 

Dxy  0.285 

7  0.286 

Ta  0.143 

Coef  S.E.  Wald  Z  Pr(>  \Z\) 


Intercept 

xl 

xl2 


-0.0842  0.1823  -0.46  0.6441 

0.5902  0.1580  3.74  0.0002 

0.1557  0.1136  1.37  0.1708 


latex (anova(f)  ,  f  ile  =  '  '  ,  table.env  =  FALSE) 


L 


x2 

d.f.  P 

xl 

13.99 

2  0.0009 

Nonlinear 

1.88 

1  0.1708 

TOTAL 

13.99 

2  0.0009 

The  bootcov  function  is  used  to  draw  1000  resamples  to  obtain  bootstrap 
estimates  of  the  covariance  matrix  of  the  regression  coefficients  as  well  as 
to  save  the  1000  x  3  matrix  of  regression  coefficients.  Then,  because  indi¬ 
vidual  regression  coefficients  for  x  do  not  tell  us  much,  we  summarize  the 
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x-effect  by  computing  the  effect  (on  the  logit  scale)  of  increasing  x  from  1 
to  5.  We  first  compute  bootstrap  nonparametric  percentile  confidence  inter¬ 
vals  the  long  way.  The  1000  bootstrap  estimates  of  the  log  odds  ratio  are 
computed  easily  using  a  single  matrix  multiplication  with  the  difference  in 
predictions  approach,  multiplying  the  difference  in  two  design  matrices,  and 
we  obtain  the  bootstrap  estimate  of  the  standard  error  of  the  log  odds  ratio 
by  computing  the  sample  standard  deviation  of  the  1000  values6.  Bootstrap 
percentile  confidence  limits  are  just  sample  quantiles  from  the  bootstrapped 
log  odds  ratios. 


# 

Get 

2 

-row  design  matrix 

for  obtain 

L 

ing  predicted  values 

# 

for 

X 

=  1  and  5 

X 

cb: 

Lnd (Intercept=l , 

predict (f ,  data 

. frame ( xl = 

c  (1 ,5)  )  ,  type  =  'x  '  )  ) 

Xdif 

X [2 ,  , drop  =  FALSE ]  - 

X [1 , , drop 

=FALSE ] 

Xdif 

Intercept  pol(xl,  2)xl  pol(xl,  2)xlA2 

2  0  4  24 

b  V-  bootcov  (f  ,  B  =  1000) 

boot  .  log  .  odds  .  rat  io  V-  b$boot.Coef  °/0*°/o  t(Xdif) 
sd(boot . log. odds . ratio  ) 

[1]  2.752103 

#  This  is  the  same  as  from  summary  (b ,  x  =  c(l,5))  as  summary 

#  uses  the  bootstrap  covariance  matrix 

summary (b ,  xl  =  c  (  1 , 5) )  [1 ,  'S.E.  '] 

[1]  2.752103 

#  Compare  this  s.d.  with  one  from  information  matrix 
summary (f ,  xl  =  c  (  1 , 5) )  [1 ,  'S.E.  '] 

[1]  2.988373 

#  Compute  percentiles  of  bootstrap  odds  ratio 
exp (quantile (boot. log. odds. ratio,  c  (  .  025  ,  .975))) 

2.5%  97.5% 

2.795032e+00  2.067146e+05 

#  Automatic: 

summary (b,  xl=c(l,5))['  Odds  Ratio',] 

e  As  indicated  below,  this  standard  deviation  can  also  be  obtained  by  using  the 
summary  function  on  the  object  returned  by  bootcov,  as  bootcov  returns  a  fit  object 
like  one  from  lrm  except  with  the  bootstrap  covariance  matrix  substituted  for  the 
information-based  one. 
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Low 

1 . 000000 e +00 
Lower  0 . 95 
2.795032e+00 


High 

5 . 000000 e +00 
Upper  0 . 95 
2 . 067146 e+05 


Dif  f  . 
4 . 000000 e +00 
Type 

2 . 000000 e +00 


Effect 
4 . 443932  e  +  02 


S  .  E  . 
NA 


print ( contrast (b ,  list(xl=5),  list(xl=l),  fun=exp) ) 


L 


Contrast  S.E.  Lower  Upper  Z  Pr(>|z|) 

11  6.09671  2.752103  1.027843  12.23909  2.22  0.0267 

Confidence  intervals  are  0.95  bootstrap  nonparametr ic  percentile  intervals 


0  5  10  15 

log(OR) 

Fig.  9.4  Distribution  of  1000  bootstrap  x=l:5  log  odds  ratios 


Figure  9.4  shows  the  distribution  of  log  odds  ratios. 

Now  consider  confidence  bands  for  the  true  log  odds  that  y  =  1,  across 
a  sequence  of  x  values.  The  Predict  function  automatically  calculates  point- 
by-point  bootstrap  percentiles,  basic  bootstrap,  or  BCa203  confidence  limits 
when  the  fit  has  passed  through  bootcov.  Simultaneous  Wald-based  confi¬ 
dence  intervals30 7  and  Wald  intervals  substituting  the  bootstrap  covariance 
matrix  estimator  are  added  to  the  plot  when  Predict  calls  the  mult  comp  pack¬ 
age  (Figure  9.5). 


xls  V-  se 

q(o , 

5 ,  len 

gth 

=  100) 

L 

pwald 

Predict 

(f  , 

xl 

=  X  1  s  ) 

psand 

Predict 

(  ro 

bco 

V  (f  )  , 

xl  =  x 

Is) 

pboot  cov 

Predict 

(b  , 

xl 

=  xls  , 

us  eb 

oot  coef 

=FALSE ) 

pbootnp 

Predict 

(b  , 

xl 

=  xls  ) 

pbootbca 

Predict 

(b  , 

xl 

=  xls  , 

boot 

. t ype  =  ' 

bca  '  ) 

pbootbas 

Predict 

(b  , 

xl 

=  xls  , 

boot 

. t ype  =  ' 

basic  '  ) 

ps imult 

Predict 

(b  , 

xl 

=  xls  , 

conf 

. t ype  =  ' 

simultaneous  '  ) 
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See  Problems  at  chapter’s  end  for  a  worrisome  investigation  of  bootstrap  con¬ 
fidence  interval  coverage  using  simulation.  It  appears  that  when  the  model’s 
log  odds  distribution  is  not  symmetric  and  includes  very  high  or  very  low 
probabilities,  neither  the  bootstrap  percentile  nor  the  bootstrap  BCa  inter¬ 
vals  have  good  coverage,  while  the  basic  bootstrap  and  ordinary  Wald  in¬ 
tervals  are  fairly  accuratef.  It  is  difficult  in  general  to  know  when  to  trust 
the  bootstrap  for  logistic  and  perhaps  other  models  when  computing  confi¬ 
dence  intervals,  and  the  simulation  problem  suggests  that  the  basic  bootstrap 
should  be  used  more  frequently.  Similarly,  the  distribution  of  bootstrap  effect 
estimates  can  be  suspect.  Asymmetry  in  this  distribution  does  not  imply  that 
the  true  sampling  distribution  is  asymmetric  or  that  the  percentile  intervals 
are  preferred. 


9.8  Further  Use  of  the  Log  Likelihood 

9.8.1  Rating  Two  Models ,  Penalizing  for  Complexity 

Suppose  that  from  a  single  sample  two  competing  models  were  developed.  Let 
the  respective  —2  log  likelihoods  for  these  models  be  denoted  by  L\  and  L2, 
and  let  pi  and  P2  denote  the  number  of  parameters  estimated  in  each  model. 
Suppose  that  L\  <  L2.  It  may  be  tempting  to  rate  model  one  as  the  “best” 
fitting  or  “best”  predicting  model.  That  model  may  provide  a  better  fit  for 
the  data  at  hand,  but  if  it  required  many  more  parameters  to  be  estimated, 
it  may  not  be  better  “for  the  money.”  If  both  models  were  applied  to  a  new 
sample,  model  one’s  overfitting  of  the  original  dataset  may  actually  result  in 
a  worse  fit  on  the  new  dataset. 


f  Limited  simulations  using  the  conditional  bootstrap  and  Firth’s  penalized  likeli¬ 
hood281  did  not  show  significant  improvement  in  confidence  interval  coverage. 


204  9  Overview  of  Maximum  Likelihood  Estimation 

.set. 

—  Boot  percentile 
—  Robust  sandwich 
—  Boot  BCa 

Boot  covariance+Wald 
—  Wald 
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—  Simultaneous 
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Fig.  9.5  Predicted  log  odds  and  confidence  bands  for  seven  types  of  confidence  in¬ 
tervals.  Seven  categories  are  ordered  top  to  bottom  corresponding  to  order  of  lower 
confidence  bands  at  xl=5.  Dotted  lines  are  for  Wald-type  methods  that  yield  sym¬ 
metric  confidence  intervals  and  assume  normality  of  point  estimators. 
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Akaike’s  information  criterion  (AIC33,359,633)  provides  a  method  for  pe¬ 
nalizing  the  log  likelihood  achieved  by  a  given  model  for  its  complexity  to 
obtain  a  more  unbiased  assessment  of  the  model’s  worth.  The  penalty  is 
to  subtract  the  number  of  parameters  estimated  from  the  log  likelihood,  or 
equivalently  to  add  twice  the  number  of  parameters  to  the  —2  log  likelihood. 
The  penalized  log  likelihood  is  analogous  to  Mallows’  Cp  in  ordinary  multiple 
regression.  AIC  would  choose  the  model  by  comparing  L\  +  2p\  to  L2  +  2p^ 
and  picking  the  model  with  the  lower  value.  We  often  use  AIC  in  “adjusted 
y2”  form: 

AIC  =  LR  x2  -  2 p.  (9.53) 

Breiman  [66,  Section  1.3]  and  Chatfield  [100,  Section  4]  discuss  the  fallacy  of 
AIC  and  Cp  for  selecting  from  a  series  of  non-prespecified  models. 


9.8.2  Testing  Whether  One  Model  Is  Better 
than  Another 

One  way  to  test  whether  one  model  ( A )  is  better  than  another  ( B )  is  to 
embed  both  models  in  a  more  general  model  ( A  +  B).  Then  a  LR  x2  test 
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can  be  done  to  test  whether  A  is  better  than  B  by  changing  the  hypothesis 
to  test  whether  A  adds  predictive  information  to  B  (Hq  :  A  +  B  >  B)  and 
whether  B  adds  information  to  A  (Hq  :  A-\-  B  >  A).  The  approach  of  testing 
A  >  B  via  testing  A  +  B  >  B  and  A  +  B  >  A  is  especially  useful  for  selecting 
from  competing  predictors  such  as  a  multivariable  model  and  a  subjective 
assessor.131, 264, 395, 669 

Note  that  LR  y2  for  Hq  :  A  +  B  >  B  minus  LR  y2  for  Hq  :  A  +  B  >  A 
equals  LR  y2  for  Hq  :  A  has  no  predictive  information  minus  LR  y2  for 
Hq  :  B  has  no  predictive  information,665  the  difference  in  LR  y2  for  testing 
each  model  (set  of  variables)  separately.  This  gives  further  support  to  the  use 
of  two  separately  computed  Akaike’s  information  criteria  for  rating  the  two 
sets  of  variables. 

See  Section  9.8.4  for  an  example. 


9.8.3  Unitless  Index  of  Predictive  Ability 

The  global  likelihood  ratio  test  for  regression  is  useful  for  determining  whether 
any  predictor  is  associated  with  the  response.  If  the  sample  is  large  enough, 
even  weak  associations  can  be  “statistically  significant.”  Even  though  a  like¬ 
lihood  ratio  test  does  not  shed  light  on  a  model’s  predictive  strength,  the  log 
likelihood  (L.L.)  can  still  be  useful  here.  Consider  the  following  L.L.s: 

Best  (lowest)  possible  —2  L.L.: 

L*  =  —2  L.L.  for  a  hypothetical  model  that  perfectly  predicts  the  outcome. 

—2  L.L.  achieved: 

L  =  —2  L.L.  for  the  fitted  model. 

Worst  —2  L.L.: 

L°  =  —2  L.L.  for  a  model  that  has  no  predictive  information. 


The  last  —2  L.L.,  for  a  “no  information”  model,  is  the  —2  L.L.  under  the  null 
hypothesis  that  ah  regression  coefficients  except  for  intercepts  are  zero.  A  “no 
information”  model  often  contains  only  an  intercept  and  some  distributional 
parameters  (a  variance,  for  example). 

The  quantity  L°  —  L  is  LR ,  the  log  likelihood  ratio  statistic  for  testing 
the  global  null  hypothesis  that  no  predictors  are  related  to  the  response.  It 
is  also  the  —2  log  likelihood  “explained”  by  the  model.  The  best  (lowest)  —2 
L.L.  is  L*,  so  the  amount  of  L.L.  that  is  capable  of  being  explained  by  the 
model  is  L°  —  L*.  The  fraction  of  —2  L.L.  explained  that  was  capable  of  being 
explained  is 


13 


(L°  —  L)/(L°  —  L*) 


LR/(L°  —  L*). 


(9.54) 
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The  fraction  of  log  likelihood  explained  is  analogous  to  R 2  in  an  ordinary 
linear  model,  although  Korn  and  Simon  65,366  provide  a  much  more  precise 
notion. 

Akaike’s  information  criterion  can  be  used  to  penalize  this  measure  of 
association  for  the  number  of  parameters  estimated  (p,  say)  to  transform 
this  unitless  measure  of  association  into  a  quantity  that  is  analogous  to  the 
adjusted  R 2  or  Mallows’  Cp  in  ordinary  linear  regression.  We  let  R  denote 
the  square  root  of  such  a  penalized  fraction  of  log  likelihood  explained.  R  is 
defined  by 

R2  =  (LR-2p)/(L0-  L*).  (9.55) 

The  R  index  can  be  used  to  assess  how  well  the  model  compares  with  a 
“perfect”  model,  as  well  as  to  judge  whether  a  more  complex  model  has  pre¬ 
dictive  strength  that  justifies  its  additional  parameters.  Had  p  been  used  in 
Equation  9.55  rather  than  2 p,  R2  is  negative  if  the  log  likelihood  explained 
is  less  than  what  one  would  expect  by  chance.  R  will  be  the  square  root  of 
1  —  2p/(Lo  —  L*)  if  the  model  perfectly  predicts  the  response.  This  upper 
limit  will  be  near  one  if  the  sample  size  is  large. 

Partial  R  indexes  can  also  be  defined  by  substituting  the  —  2  L.L.  explained 
for  a  given  factor  in  place  of  that  for  the  entire  model,  LR.  The  “penalty 
factor”  p  becomes  one.  This  index  impartial  is  defined  by 

^partial  =  (^^partial  —  2)/(Lo  —  L  ),  (9.56) 

which  is  the  (penalized)  fraction  of  —2  log  likelihood  explained  by  the  pre¬ 
dictor.  Here  impartial  is  the  log  likelihood  ratio  statistic  for  testing  whether 
the  predictor  is  associated  with  the  response,  after  adjustment  for  the  other 
predictors.  Since  such  likelihood  ratio  statistics  are  tedious  to  compute,  the 
1  d.f.  Wald  x2  can  be  substituted  for  the  LR  statistic  (keeping  in  mind  that 
difficulties  with  the  Wald  statistic  can  arise). 

Liu  and  Dyer424  and  Cox  and  Wermuth  36  point  out  difficulties  with  the 
R2  measure  for  binary  logistic  models.  Cox  and  Snell  and  Magee43  used 
other  analogies  to  derive  other  R2  measures  that  may  have  better  properties. 
For  a  sample  of  size  n  and  a  Wald  statistic  for  testing  overall  association, 
they  defined 


W 

n  +  W 

1  —  exp(— LR/n)  (9.57) 

1  -  A2/", 

where  A  is  the  null  model  likelihood  divided  by  the  fitted  model  likelihood.  In 
the  case  of  ordinary  least  squares  with  normality  both  of  the  above  indexes 
are  equal  to  the  traditional  R2 .  R^r  is  equivalent  to  Maddala’s  index  [431, 
Eq.  2.44].  Cragg  and  Uhler13^  and  Nagelkerke471  suggested  dividing  R^r  by 


Rw  — 


RIr 
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its  maximum  attainable  value 

Cl  =  1  -  exp(-L°/n)  (9.58) 

to  derive  R ^  which  ranges  from  0  to  1.  This  is  the  form  of  the  R 2  index  we 
use  throughout. 

For  penalizing  for  overfitting,  see  Verweij  and  van  Houwel ingen  for  an 

overfitting-corrected  R2  that  uses  a  cross- validated  likelihood. 


9.8.4  Unitless  Index  of  Adequacy  of  a  Subset 
of  Predictors 


Log  likelihoods  are  also  useful  for  quantifying  the  predictive  information  con¬ 
tained  in  a  subset  of  the  predictors  compared  with  the  information  contained 
in  the  entire  set  of  predictors.264  Let  LR  again  denote  the  —2  log  likelihood 
ratio  statistic  for  testing  the  joint  significance  of  the  full  set  of  predictors.  Let 
LRS  denote  the  —2  log  likelihood  ratio  statistic  for  testing  the  importance  of 
the  subset  of  predictors  of  interest,  excluding  the  other  predictors  from  the 
model.  A  measure  of  adequacy  of  the  subset  for  predicting  the  response  is 
given  by 

A  =  LRS/LR.  (9.59) 

A  is  then  the  proportion  of  log  likelihood  explained  by  the  subset  with  refer¬ 
ence  to  the  log  likelihood  explained  by  the  entire  set.  When  A  =  1,  the  subset 
contains  ah  the  predictive  information  found  in  the  whole  set  of  predictors; 
that  is,  the  subset  is  adequate  by  itself  and  the  additional  predictors  contain 
no  independent  information.  When  A  =  0,  the  subset  contains  no  predictive 
information  by  itself. 

Califf  et  al.  used  the  A  index  to  quantify  the  adequacy  (with  respect  to 
prognosis)  of  two  competing  sets  of  predictors  that  each  describe  the  extent  of 
coronary  artery  disease.  The  response  variable  was  time  until  cardiovascular 
death  and  the  statistical  model  used  was  the  Cox132  proportional  hazards 
model.  Some  of  their  results  are  reproduced  in  Table  9.3.  A  chance-corrected 
adequacy  measure  could  be  derived  by  squaring  the  ratio  of  the  Abindex  for 
the  subset  to  the  Abindex  for  the  whole  set.  A  formal  test  of  superiority  of 
X\  —  maximum  %  stenosis  over  X2  =  jeopardy  score  can  be  obtained  by 
testing  whether  X\  adds  to  X2  ( LR  y2  =  57.5  —  42.6  =  14.9)  and  whether 
X2  adds  to  Xi  ( LR  y2  =  57.5  —  51.8  =  5.7).  X\  adds  more  to  X2  (14.9)  than 
X2  adds  to  X\  (5.7).  The  difference  14.9  —  5.7  =  9.2  equals  the  difference  in 
single  factor  y2  (51.8  —  42. 6)665. 
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Table  9.3  Completing  prognostic  markers 


Predictors  Used 

LR  x2 

Adequacy 

Coronary  jeopardy  score 

42.6 

0.74 

Maximum  %  stenosis  in  each  artery 

51.8 

0.90 

Combined 

57.5 

1.00 

9.9  Weighted  Maximum  Likelihood  Estimation 

It  is  commonly  the  case  that  data  elements  represent  combinations  of  values 
that  pertain  to  a  set  of  individuals.  This  occurs,  for  example,  when  unique 
combinations  of  X  and  Y  are  determined  from  a  massive  dataset,  along  with 
the  frequency  of  occurrence  of  each  combination,  for  the  purpose  of  reducing 
the  size  of  the  dataset  to  analyze.  For  the  ith  combination  we  have  a  case 
weight  Wi  that  is  a  positive  integer  representing  a  frequency.  Assuming  that 
observations  represented  by  combination  i  are  independent,  the  likelihood 
needed  to  represent  all  Wi  observations  is  computed  simply  by  multiplying 
all  of  the  likelihood  elements  (each  having  value  L^),  yielding  a  total  likeli¬ 
hood  contribution  for  combination  i  of  L\Ui  or  a  log  likelihood  contribution 
of  Wi  log  Li.  To  obtain  a  likelihood  for  the  entire  dataset  one  computes  the 
product  over  all  combinations.  The  total  log  likelihood  is  J2wi  l°g  Ti-  As  an 
example,  the  weighted  likelihood  that  would  be  used  to  fit  a  weighted  logistic 
regression  model  is  given  by 


n 


L  =  ~[[PpY'(l-Pl) 


Wi(l-Yi) 


(9.60) 


where  there  are  n  combinations,  JT=1  wi  >  n,  and  Pi  is  Prob[U  =  1| Xi\  as 
dictated  by  the  model.  Note  that  in  general  the  correct  likelihood  function 
cannot  be  obtained  by  weighting  the  data  and  using  an  unweighted  likelihood. 

By  a  small  leap  one  can  obtain  weighted  maximum  likelihood  estimates 
from  the  above  method  even  if  the  weights  do  not  represent  frequencies  or 
even  integers,  as  long  as  the  weights  are  non-negative.  Non-frequency  weights 
are  commonly  used  in  sample  surveys  to  adjust  estimates  back  to  better 
represent  a  target  population  when  some  types  of  subjects  have  been  over¬ 
sampled  from  that  population.  Analysts  should  beware  of  possible  losses  in 
efficiency  when  obtaining  weighted  estimates  in  sample  surveys.363,364  Mak¬ 
ing  the  regression  estimates  conditional  on  sampling  strata  by  including  strata 
as  covariables  may  be  preferable  to  re-weighting  the  strata.  If  weighted  esti¬ 
mates  must  be  obtained,  the  weighted  likelihood  function  is  generally  valid 
for  obtaining  properly  weighted  parameter  estimates.  However,  the  variance- 
covariance  matrix  obtained  by  inverting  the  information  matrix  from  the 
weighted  likelihood  will  not  be  correct  in  general.  For  one  thing,  the  sum  of 
the  weights  may  be  far  from  the  number  of  subjects  in  the  sample.  A  rough 
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approximation  to  the  variance-covariance  matrix  may  be  obtained  by  first 
multiplying  each  weight  by  n/J2wi  and  then  computing  the  weighted  in¬ 
formation  matrix,  where  n  is  the  number  of  actual  subjects  in  the  sample. 


9.10  Penalized  Maximum  Likelihood  Estimation 


Maximizing  the  log  likelihood  provides  the  best  fit  to  the  dataset  at  hand, 
but  this  can  also  result  in  fitting  noise  in  the  data.  For  example,  a  categor¬ 
ical  predictor  with  20  levels  can  produce  extreme  estimates  for  some  of  the 
19  regression  parameters,  especially  for  the  small  cells  (see  Section  4.5).  A 
shrinkage  approach  will  often  result  in  regression  coefficient  estimates  that 
while  biased  are  lower  in  mean  squared  error  and  hence  are  more  likely  to  be 
close  to  the  true  unknown  parameter  values.  Ridge  regression  is  one  approach 
to  shrinkage,  but  a  more  general  and  better  developed  approach  is  penalized 
maximum  likelihood  estimation,237, 388,639,641  which  is  really  a  special  case 
of  Bayesian  modeling  with  a  Gaussian  prior.  Letting  L  denote  the  usual  like¬ 
lihood  function  and  A  be  a  penalty  factor,  we  maximize  the  penalized  log 
likelihood  given  by 

1  P 

log£  -  2AXA^)2’  (9-61) 

i—  1 

where  si,  •  •  • ,  sp  are  scale  factors  chosen  to  make  Si/3i  unitless.  Most  au¬ 
thors  standardize  the  data  first  and  do  not  have  scale  factors  in  the  equation, 
but  Equation  9.61  has  the  advantage  of  allowing  estimation  of  f3  on  the  orig¬ 
inal  scale  of  the  data.  The  usual  methods  (e.g.,  Newton-Raphson)  are  used 
to  maximize  9.61. 

The  choice  of  the  scaling  constants  has  received  far  too  little  attention  in 
the  ridge  regression  and  penalized  MLE  literature.  It  is  common  to  use  the 
standard  deviation  of  each  column  of  the  design  matrix  to  scale  the  corre¬ 
sponding  parameter.  For  models  containing  nothing  but  continuous  variables 
that  enter  the  regression  linearly,  this  is  usually  a  reasonable  approach.  For 
continuous  variables  represented  with  multiple  terms  (one  of  which  is  lin¬ 
ear),  it  is  not  always  reasonable  to  scale  each  nonlinear  term  with  its  own 
standard  deviation.  For  dummy  variables,  scaling  using  the  standard  devia¬ 
tion  (yj d(  1  —  d),  where  d  is  the  mean  of  the  dummy  variable,  i.e.,  the  frac¬ 
tion  of  observations  in  that  cell)  is  problematic  since  this  will  result  in  high 
pre valance  cells  getting  more  shrinkage  than  low  prevalence  ones  because  the 
high  prevalence  cells  will  dominate  the  penalty  function. 

An  advantage  of  the  formulation  in  Equation  9.61  is  that  one  can  assign 
scale  constants  of  zero  for  parameters  for  which  no  shrinkage  is  desired.237, 639 
For  example,  one  may  have  prior  beliefs  that  a  linear  additive  model  will  fit 
the  data.  In  that  case,  nonlinear  and  non-additive  terms  may  be  penalized. 
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For  a  categorical  predictor  having  c  levels,  users  of  ridge  regression  often  do 
not  recognize  that  the  amount  of  shrinkage  and  the  predicted  values  from  the 
fitted  model  depend  on  how  the  design  matrix  is  coded.  For  example,  one  will 
get  different  predictions  depending  on  which  cell  is  chosen  as  the  reference 
cell  when  constructing  dummy  variables.  The  setup  in  Equation  9.61  has  the 
same  problem.  For  example,  if  for  a  three-category  factor  we  use  category  1 
as  the  reference  cell  and  have  parameters  fa  and  fa,  the  unsealed  penalty 
function  is  fa  +  fa-  If  category  3  were  used  as  the  reference  cell  instead,  the 
penalty  would  be  fa  +  (fa  —  fa)2.  To  get  around  this  problem,  Verweij  and 
van  Houwelingen  9  proposed  using  the  penalty  function  Yfafai  —  fa2 •>  where 
[3  is  the  mean  of  all  c  /3s.  This  causes  shrinkage  of  all  parameters  toward 
the  mean  parameter  value.  Letting  the  first  category  be  the  reference  cell, 
we  use  c  —  1  dummy  variables  and  define  fa  =  0.  For  the  case  c  —  3  the 
sum  of  squares  is  2[fa  +  fa  —  fa  fa]  fa.  For  c  —  2  the  penalty  is  fa  fa.  If  no 
scale  constant  is  used,  this  is  the  same  as  scaling  fa  with  \fa  x  the  standard 
deviation  of  a  binary  dummy  variable  with  prevalance  of  0.5. 

The  sum  of  squares  can  be  written  in  matrix  form  as  [fa, . . . ,  fa]' 
(A-B)[fa,...  ,  fa] ,  where  A  is  a  c  -  1  x  c  -  1  identity  matrix  and  B  is 
ac— lxc— 1  matrix  all  of  whose  elements  are  - . 

c 

For  general  penalty  functions  such  as  that  just  described,  the  penalized 
log  likelihood  can  be  generalized  to 


logL-fa’P/3. 


(9.62) 


For  purposes  of  using  the  Newton-Raphson  procedure,  the  first  derivative 
of  the  penalty  function  with  respect  to  [3  is  —A Pfa  and  the  negative  of  the 
second  derivative  is  A P. 

Another  problem  in  penalized  estimation  is  how  the  choice  of  A  is  made. 
Many  authors  use  cross-validation.  A  limited  number  of  simulation  stud¬ 
ies  in  binary  logistic  regression  modeling  has  shown  that  for  each  A  being 
considered,  at  least  10-fold  cross-validation  must  be  done  so  as  to  obtain  a 
reasonable  estimate  of  predictive  accuracy.  Even  then,  a  smoother207  (“su¬ 
per  smoother”)  must  be  used  on  the  (A,  accuracy)  pairs  to  allow  location  of 
the  optimum  value  unless  one  is  careful  in  choosing  the  initial  sub-samples 
and  uses  these  same  splits  throughout.  Simulation  studies  have  shown  that  a 
modified  AIC  is  not  only  much  quicker  to  compute  (since  it  requires  no  cross- 
validation)  but  performs  better  at  finding  a  good  value  of  A  (see  below). 

For  a  given  A,  the  effective  number  of  parameters  being  estimated  is  re¬ 
duced  because  of  shrinkage.  Gray  [237,  Eq.  2.9]  and  others  estimate  the  ef¬ 
fective  degrees  of  freedom  by  computing  the  expected  value  of  a  global  Wald 
statistic  for  testing  association,  when  the  null  hypothesis  of  no  association  is 
true.  The  d.f.  is  equal  to 


trace  [I(/3P)V  (/3P)], 


(9.63) 
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where  /r  is  the  penalized  MLE  (the  parameters  that  maximize  Equa¬ 
tion  9.61),  I  is  the  information  matrix  computed  from  ignoring  the  penalty 
function,  and  V  is  the  covariance  matrix  computed  by  inverting  the  infor¬ 
mation  matrix  that  included  the  second  derivatives  with  respect  to  / 3  in  the 
penalty  function. 

Gray  [237,  Eq.  2.6]  states  that  a  better  estimate  of  the  variance-covariance 
matrix  for  pp  than  V0P )  is 

V*  =  V0p)I0p)V0p).  (9.64) 

Therneau  (personal  communication,  2000)  has  found  in  a  limited  number 
of  simulation  studies  that  V *  underestimates  the  true  variances,  and  that  a 
better  estimate  of  the  variance-covariance  matrix  is  simply  V(f3p),  assuming 
that  the  model  is  correctly  specified.  This  is  the  covariance  matrix  used  by 
default  in  the  rms  package  (the  user  can  request  that  the  sandwich  estimator 
be  used  instead)  and  is  in  fact  the  one  Gray  used  for  Wald  tests. 

Penalization  will  bias  estimates  of  /?,  so  hypothesis  tests  and  confidence 
intervals  using  f3p  may  not  have  a  simple  interpretation.  The  same  prob¬ 
lem  arises  in  score  and  likelihood  ratio  tests.  So  far,  penalization  is  better 
understood  in  pure  prediction  mode  unless  Bayesian  methods  are  used. 

Equation  9.63  can  be  used  to  derive  a  modified  AIC  (see  [639,  Eq.  6] 
and  [641,  Eq.  7])  on  the  model  y2  scale: 

LR  y2  —  2  x  effective  d.f.,  (9.65) 

where  LR  y2  is  the  likelihood  ratio  y2  for  the  penalized  model,  but  ignoring 
the  penalty  function.  If  a  variety  of  A  are  tried  and  one  plots  the  (A,  AIC) 
pairs,  the  A  that  maximizes  AIC  will  often  be  a  good  choice,  that  is,  it  is 
likely  to  be  near  the  value  of  A  that  maximizes  predictive  accuracy  on  a 
future  dataset8. 

Note  that  if  one  does  penalized  maximum  likelihood  estimation  where  a  set 
of  variables  being  penalized  has  a  negative  value  for  the  unpenalized  y2  —  2  x 
d.f.,  the  value  of  A  that  will  optimize  the  overall  model  AIC  will  be  oo. 

As  an  example,  consider  some  simulated  data  (n  =  100)  with  one  predictor 
in  which  the  true  model  is  Y  =  X\  +  e,  where  e  has  a  standard  normal 
distribution  and  so  does  X\.  We  use  a  series  of  penalties  (found  by  trial  and 
error)  that  give  rise  to  sensible  effective  d.f.,  and  fit  penalized  restricted  cubic 
spline  functions  with  five  knots.  We  penalize  two  ways:  all  terms  in  the  model 
including  the  coefficient  of  Xi,  which  in  reality  needs  no  penalty;  and  only 
the  nonlinear  terms.  The  following  R  program,  in  conjunction  with  the  rms 
package,  does  the  job. 


g  Several  examples  from  simulated  datasets  have  shown  that  using  BIC  to  choose  a 
penalty  results  in  far  too  much  shrinkage. 
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set . seed  (  191 ) 
xl  V-  rnorm(lOO) 
y  V-  xl  +  rnorm(lOO) 

pens  V-  df  V-  aic  V-  c  (0  ,  .  07  ,  .  5  ,  2 , 6 , 15 , 60 ) 
all  nl  V-  list() 

for (penalize  in  1:2)  { 

for(i  in  1 : length ( pens ) )  { 

f  V-  ols  (y  rcs(xl,5),  penalty  = 

list  (  simple  =  if  (penalize  ==1)  pens  [i]  else  0, 
nonline ar  =  pens  [i])) 
df  [i]  V-  f  $  stats  ['  d  .  f  .  '] 
aic  [i]  V-  AIC (f  ) 

nam  V-  past e ( if ( penal ize  ==  1)  'all'  else  '  nl  '  , 

'  penalty:',  pens  [i]  ,  sep  =  '  '  ) 
nam  V-  as . char act er ( pens [ i ] ) 

p  V-  Predict (f ,  xl=seq(-2.5,  2.5,  length=100), 

conf . int=FALSE) 

if  (penalize  ==  1)  all  [[nam]]  V-  p  else  nl  [  [nam]  ]  V-  p 

} 

print ( rbind ( df =df ,  aic=aic) ) 

} 
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The  left  panel  in  Figure  9.6  corresponds  to  penalty  =  list(simple=a,  nonlin- 
ear=a)  in  the  R  program,  meaning  that  all  parameters  except  the  intercept  are 
shrunk  by  the  same  amount  a  (this  would  be  more  appropriate  had  there  been 
multiple  predictors).  As  effective  d.f.  get  smaller  (penalty  factor  gets  larger), 
the  regression  fits  get  flatter  (too  flat  for  the  largest  penalties)  and  confidence 
bands  get  narrower.  The  right  graph  corresponds  to  penalty=list (simple=0, 
nonlinear=a) ,  causing  only  the  cubic  spline  terms  that  are  nonlinear  in  X\ 
to  be  shrunk.  As  the  amount  of  shrinkage  increases  (d.f.  lowered),  the  fits 
become  more  linear  and  closer  to  the  true  regression  line  (longer  dotted  line) . 
Again,  confidence  intervals  become  smaller. 
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Fig.  9.6  Penalized  least  squares  estimates  for  an  unnecessary  five-knot  restricted 
cubic  spline  function.  In  the  left  graph  all  parameters  (except  the  intercept)  are 
penalized.  The  effective  d.f.  are  4,3.21,2.71,2.30,2.03,1.82,  and  1.51.  In  the  right 
graph,  only  parameters  associated  with  nonlinear  functions  of  X±  are  penalized.  The 
effective  d.f.  are  4,  3.22,  2.73,  2.34,  2.11,  1.96,  and  1.68. 
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Boos60  has  some  nice  generalizations  of  the  score  test.  Morgan  et  al.464  show 
how  score  test  y2  statistics  may  negative  unless  the  expected  information  matrix 
is  used. 

See  Marubini  and  Valsecchi  [444,  pp.  164—169]  for  an  excellent  description  of 
the  relationship  between  the  three  types  of  test  statistics. 

References  [115,507]  have  good  descriptions  of  methods  used  to  maximize  logL. 
As  Long  and  Ervim  26  argue,  for  small  sample  sizes,  the  usual  Huber-White  co- 
variance  estimator  should  not  be  used  because  there  the  residuals  do  not  have 
constant  variance  even  under  homoscedasticity.  They  showed  that  a  simple  cor¬ 
rection  due  to  Efron  and  others  can  result  in  substantially  better  estimates. 
Lin  and  Wei,410  Binder,55  and  Lin40'  have  applied  the  Huber  estimator  to  the 
Cox132  survival  model.  Freedman206  questioned  the  use  of  sandwich  estima¬ 
tors  because  they  are  often  used  to  obtain  the  right  variances  on  the  wrong 
parameters  when  the  model  doesn’t  fit.  He  also  has  some  excellent  background 
information. 

Feng  et  al.188  showed  that  in  the  case  of  cluster  correlations  arising  from  re¬ 
peated  measurement  data  with  Gaussian  errors,  the  cluster  bootstrap  performs 
excellently  even  when  the  number  of  observations  per  cluster  is  large  and  the 
number  of  subjects  is  small.  Xiao  and  Abrahamowicz676  compared  the  cluster 
bootstrap  with  a  two-stage  cluster  bootstrap  in  the  context  of  the  Cox  model. 
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Graubard  and  Korn235  and  Fitzmaurice195  describe  the  kinds  of  situations  in 
which  the  working  independence  model  can  be  trusted. 

Minkin,460  Alho,11  Doganaksoy  and  Schmee,160  and  Meeker  and  Escobar452 
discuss  the  need  for  LR  and  score-based  confidence  intervals.  Alho  found  that 
score-based  intervals  are  usually  more  tedious  to  compute,  and  provided  use¬ 
ful  algorithms  for  the  computation  of  either  type  of  interval  (see  also  [452] 
and  [444,  p.  167]).  Score  and  LR  intervals  require  iterative  computations  and 
have  to  deal  with  the  fact  that  when  one  parameter  is  changed  (e.g.,  bi  is  re¬ 
stricted  to  be  zero),  all  other  parameter  estimates  change.  DiCiccio  and  Efron15 7 
provide  a  method  for  very  accurate  confidence  intervals  for  exponential  families 
that  requires  a  modest  amount  of  additional  computation.  Venzon  and  Mool- 
gavkar  provide  an  efficient  general  method  for  computing  LR-based  intervals.636 
Brazzale  and  Davison65  developed  some  promising  and  feasible  ways  to  make 
unconditional  likelihood-based  inferences  more  accurate  in  small  samples. 
Carpenter  and  Bit  hell9  2  have  an  excellent  overview  of  several  variations  on  the 
bootstrap  for  obtaining  confidence  limits. 

Tibshirani  and  Knight610  developed  an  easy  to  program  approach  for  deriving 
simultaneous  confidence  sets  that  is  likely  to  be  useful  for  getting  simultaneous 
confidence  regions  for  the  entire  vector  of  model  parameters,  for  population  val¬ 
ues  for  an  entire  sequence  of  predictor  values,  and  for  a  set  of  regression  effects 
(e.g.,  interquartile-range  odds  ratios  for  age  for  both  sexes).  The  basic  idea  is 
that  during  the,  say,  1000  bootstrap  repetitions  one  stores  the  —2  log  likelihood 
for  each  model  fit,  being  careful  to  compute  the  likelihood  at  the  current  boot¬ 
strap  parameter  estimates  but  with  respect  to  the  original  data  matrix,  not 
the  bootstrap  sample  of  the  data  matrix.  To  obtain  an  approximate  simultane¬ 
ous  0.95  confidence  set  one  computes  the  0.95  quantile  of  the  —2  log  likelihood 
values  and  determines  which  vectors  of  parameter  estimates  correspond  to  —2 
log  likelihoods  that  are  at  least  as  small  as  the  0.95  quantile  of  ah  —2  log  like¬ 
lihoods.  Once  the  qualifying  parameter  estimates  are  found,  the  quantities  of 
interest  are  computed  from  those  parameter  estimates  and  an  outer  envelope 
of  those  quantities  is  found.  Computations  are  facilitated  with  the  rms  package 
confplot  function. 

van  Houwelingen  and  le  Cessie  [633,  Eq.  52]  showed,  consistent  with  AIC,  that 
the  average  optimism  in  a  mean  logarithmic  (minus  log  likelihood)  quality  score 
for  logistic  models  is  p/n. 

Schwarz560  derived  a  different  penalty  using  large-sample  Bayesian  properties 
of  competing  models.  His  Bayesian  Information  Criterion  (BIC)  chooses  the 
model  having  the  lowest  value  of  L  +  l/2plogn  or  the  highest  value  of  LR 
X2  —  plogn.  Kass  and  Raftery  have  done  several  studies  of  BIC.337  Smith 
and  Spiegelhalter576  and  Laud  and  Ibrahim3  discussed  other  useful  gener¬ 
alizations  of  likelihood  penalties.  Zheng  and  Loh685  studied  several  penalty 
measures,  and  found  that  AIC  does  not  penalize  enough  for  overhtting  in  the 
ordinary  regression  case.  Kass  and  Raftery  [337,  p.  790]  provide  a  nice  review 
of  this  topic,  stating  that  “AIC  picks  the  correct  model  asymptotically  if  the 
complexity  of  the  true  model  grows  with  sample  size”  and  that  “AIC  selects 
models  that  are  too  big  even  when  the  sample  size  is  large.”  But  they  also  cite 
other  papers  that  show  the  existence  of  cases  where  AIC  can  work  better  than 
BIC.  According  to  Buckland  et  ah, 80  BIC  “assumes  that  a  true  model  exists 
and  is  low-dimensional.” 

Hurvich  and  Tsai314,315  made  an  improvement  in  AIC  that  resulted  in  much 
better  model  selection  for  small  n.  They  defined  the  corrected  AIC  as 


p  +  1 
n  —  p  —  1 


AICc  =  LR  x2  —  2p[l  + 


(9.66) 
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In  [314]  they  contrast  asymptotically  efficient  model  selection  with  AIC  when 
the  true  model  has  infinitely  many  parameters  with  improvements  using  other 
indexes  such  as  AICc  when  the  model  is  finite. 

One  difficulty  in  applying  the  Schwarz,  AICc,  and  related  criteria  is  that  with 
censored  or  binary  responses  it  is  not  clear  that  the  actual  sample  size  n  should 
be  used  in  the  formula. 

12  Goldstein,222  Willan  et  ah, 669  and  Royston  and  Thompson534  have  nice  dis¬ 
cussions  on  comparing  non-nested  regression  models.  Schemper’s  method549  is 
useful  for  testing  whether  a  set  of  variables  provides  significantly  greater  infor¬ 
mation  (using  an  R2  measure)  than  another  set  of  variables. 

13  van  Houwelingen  and  le  Cessie  [633,  Eq.  22]  recommended  using  L/2  (also  called 
the  Kullback-Leibler  error  rate)  as  a  quality  index. 

14  Schemper549  provides  a  bootstrap  technique  for  testing  for  significant  differ¬ 
ences  between  correlated  R2  measures.  Mittlbock  and  Schemper,461  Schemper 
and  Stare,554  Korn  and  Simon,365, 366  Menard,454  and  Zheng  and  Agresti684 
have  excellent  discussions  about  the  pros  and  cons  of  various  indexes  of  the 
predictive  value  of  a  model. 

15  Al-Radi  et  al.10  presented  another  analysis  comparing  competing  predictors 
using  the  adequacy  index  and  a  receiver  operating  characteristic  curve  area 
approach  based  on  a  test  for  whether  one  predictor  has  a  higher  probability  of 
being  “more  concordant”  than  another. 

16  [55,97,409]  provide  good  variance-covariance  estimators  from  a  weighted  max¬ 
imum  likelihood  analysis. 

17  Huang  and  Harrington3 10  developed  penalized  partial  likelihood  estimates  for 
Cox  models  and  provided  useful  background  information  and  theoretical  results 
about  improvements  in  mean  squared  errors  of  regression  estimates.  They  used 
a  bootstrap  error  estimate  for  selection  of  the  penalty  parameter. 

18  Sardy538  proposes  that  the  square  roots  of  the  diagonals  of  the  inverse  of  the 
covariance  matrix  for  the  predictors  be  used  for  scaling  rather  than  the  standard 
deviations. 

19  Park  and  Hastie483  and  articles  referenced  therein  describe  how  quadratic  pe¬ 
nalized  logistic  regression  automatically  sets  coefficient  estimates  for  empty  cells 
to  zero  and  forces  the  sum  of  k  coefficients  for  a  k- level  categorical  predictor  to 
equal  zero. 

20  Greenland241  has  a  nice  discussion  of  the  relationship  between  penalized  max¬ 
imum  likelihood  estimation  and  mixed  effects  models.  He  cautions  against  esti¬ 
mating  the  shrinkage  parameter. 

See310  for  a  bootstrap  approach  to  selection  of  A. 

Verweij  and  van  Houwelingen  [639,  Eq.  4]  derived  another  expression  for  d.f.,  but 
it  requires  more  computation  and  did  not  perform  any  better  than  Equation  9.63 
in  choosing  A  in  several  examples  tested. 

See  van  Houwelingen  and  Thorogood  31  for  an  approximate  empirical  Bayes 
approach  to  shrinkage.  See  Tibshirani608  for  the  use  of  a  non-smooth  penalty 
function  that  results  in  variable  selection  as  well  as  shrinkage  (see  Section  4.3). 
Verweij  and  van  Houwelingen640  used  a  “cross-validated  likelihood”  based  on 
leave-out-one  estimates  to  penalize  for  overfitting.  Wang  and  Taylor652  pre¬ 
sented  some  methods  for  carrying  out  hypothesis  tests  and  computing  con¬ 
fidence  limits  under  penalization.  Moons  et  al.462  presented  a  case  study  of 
penalized  estimation  and  discussed  the  advantages  of  penalization. 
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Table  9.4  Likelihood  ratio  global  test  statistics 


Variables  in  Model  LR  y2 


age 

100 

sex 

108 

age,  sex 

111 

age2 

60 

age,  age2 

102 

age,  age2,  sex 

115 

9.12  Problems 

1.  A  sample  of  size  100  from  a  normal  distribution  with  unknown  mean  and 
standard  deviation  {fi  and  a)  yielded  the  following  log  likelihood  values 
when  computed  at  two  values  of  fi. 

log  L(/i  =  10,cr  =  5)  =  -800 
log  L(/i  =  20,(7  =  5)  =  -820. 

What  do  you  know  about  fil  What  do  you  know  about  V? 

2.  Several  regression  models  were  considered  for  predicting  a  response.  LR  y2 
(corrected  for  the  intercept)  for  models  containing  various  combinations  of 
variables  are  found  in  Table  9.4.  Compute  all  possible  meaningful  LR  y2. 
For  each,  state  the  d.f.  and  an  approximate  P-value.  State  which  LR  y2 
involving  only  one  variable  is  not  very  meaningful. 

3.  For  each  problem  below,  rank  Wald,  score,  and  LR  statistics  by  overall 
statistical  properties  and  then  by  computational  convenience. 

a.  A  forward  stepwise  variable  selection  (to  be  later  accounted  for  with  the 
bootstrap)  is  desired  to  determine  a  concise  model  that  contains  most 
of  the  independent  information  in  all  potential  predictors. 

b.  A  test  of  independent  association  of  each  variable  in  a  given  model  (each 
variable  adjusted  for  the  effects  of  all  other  variables  in  the  given  model) 
is  to  be  obtained. 

c.  A  model  that  contains  only  additive  effects  is  fitted.  A  large  number 
of  potential  interaction  terms  are  to  be  tested  using  a  global  (multiple 

d.f.)  test. 

4.  Consider  a  univariate  saturated  model  in  3  treatments  (A,  B,  C)  that  is 
quadratic  in  age.  Write  out  the  model  with  all  the  /3s,  and  write  in  detail 
the  contrast  for  comparing  treatment  B  with  treatment  C  for  30  year  olds. 
Sketch  out  the  same  contrast  using  the  “difference  in  predictions”  approach 
without  simplification. 
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5.  Simulate  a  binary  logistic  model  for  n  =  300  with  an  average  fraction  of 
events  somewhere  between  0.15  and  0.3.  Use  5  continuous  covariates  and 
assume  the  model  is  everywhere  linear.  Fit  an  unpenalized  model,  then 
solve  for  the  optimum  quadratic  penalty  A.  Relate  the  resulting  effective 
d.f.  to  the  15:1  rule  of  thumb,  and  compute  the  heuristic  shrinkage  coeffi¬ 
cient  7  for  the  unpenalized  model  and  for  the  optimally  penalized  model, 
inserting  the  effective  d.f.  for  the  number  of  non-intercept  parameters  in 
the  model. 

6.  For  a  similar  setup  as  the  binary  logistic  model  simulation  in  Section  9.7, 
do  a  Monte  Carlo  simulation  to  determine  the  coverage  probabilities  for 
ordinary  Wald  and  for  three  types  of  bootstrap  confidence  intervals  for  the 
true  x=5  to  x=l  log  odds  ratio.  In  addition,  consider  the  Wald- type  con¬ 
fidence  interval  arising  from  the  sandwich  covariance  estimator.  Estimate 
the  non-coverage  probabilities  in  both  tails.  Use  a  sample  size  n  =  200 
with  the  single  predictor  x\  having  a  standard  log-normal  distribution, 
and  the  true  model  being  logit(U  =  1)  =  1  +  x\j 2.  Determine  whether 
increasing  the  sample  size  relieves  any  problem  you  observed.  Some  R  code 
for  this  simulation  is  on  the  web  site. 


Chapter  10 

Binary  Logistic  Regression 


10.1  Model 


Binary  responses  are  commonly  studied  in  many  fields.  Examples  include 
the  presence  or  absence  of  a  particular  disease,  death  during  surgery,  or  a 
consumer  purchasing  a  product.  Often  one  wishes  to  study  how  a  set  of 
predictor  variables  X  is  related  to  a  dichotomous  response  variable  Y.  The 
predictors  may  describe  such  quantities  as  treatment  assignment,  dosage,  risk 
factors,  and  calendar  time. 

For  convenience  we  define  the  response  to  be  7  =  0  or  1,  with  Y  =  1 
denoting  the  occurrence  of  the  event  of  interest.  Often  a  dichotomous  outcome 
can  be  studied  by  calculating  certain  proportions,  for  example,  the  proportion 
of  deaths  among  females  and  the  proportion  among  males.  However,  in  many 
situations,  there  are  multiple  descriptors,  or  one  or  more  of  the  descriptors 
are  continuous.  Without  a  statistical  model,  studying  patterns  such  as  the 
relationship  between  age  and  occurrence  of  a  disease,  for  example,  would 
require  the  creation  of  arbitrary  age  groups  to  allow  estimation  of  disease 
prevalence  as  a  function  of  age. 

Letting  X  denote  the  vector  of  predictors  {W,  -W, . . . ,  WJ,  a  first  attempt 
at  modeling  the  response  might  use  the  ordinary  linear  regression  model 

E{Y\X}  =  Xf3 ,  (10.1) 

since  the  expectation  of  a  binary  variable  Y  is  Prob{T  =1}.  However,  such 
a  model  by  definition  cannot  fit  the  data  over  the  whole  range  of  the  pre¬ 
dictors  since  a  purely  linear  model  E{T|X}  =  Prob{E  =  1\X}  =  X/3  can 
allow  Prob{E  =  1}  to  exceed  1  or  fall  below  0.  The  statistical  model  that  is 
generally  preferred  for  the  analysis  of  binary  responses  is  instead  the  binary 
logistic  regression  model,  stated  in  terms  of  the  probability  that  Y  =  1  given 
X ,  the  values  of  the  predictors: 
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Prob{Y  =  1\X}  =  [1  +  exp(-X/3)]_1.  (10.2) 


As  before,  X/3  stands  for  /3o  +  f3\X\  +  P2X2  +  . . .  +  PkXk-  The  binary  lo¬ 
gistic  regression  model  was  developed  primarily  by  Cox129  and  Walker  and 
Duncan.647  The  regression  parameters  /3  are  estimated  by  the  method  of 
maximum  likelihood  (see  below). 

The  function 


P  =  [1  +  exp(— x) 


1 


(10.3) 


is  called  the  logistic  function.  This  function  is  plotted  in  Figure  10.1  for  x 
varying  from  —4  to  +4.  This  function  has  an  unlimited  range  for  x  while  P 
is  restricted  to  range  from  0  to  1. 


X 


Fig.  10.1  Logistic  function 


For  future  derivations  it  is  useful  to  express  x  in  terms  of  P.  Solving  the 
equation  above  for  x  by  using 


1  —  P  —  exp(— x)/[l  +  exp(— x) 


(10.4) 


yields  the  inverse  of  the  logistic  function: 


x  =  log[P/(l  —  P)]  =  log[odds  that  Y  =  1  occurs 


logit{T  =  1}.  (10.5) 


Other  methods  that  have  been  used  to  analyze  binary  response  data  in¬ 
clude  the  probit  model,  which  writes  P  in  terms  of  the  cumulative  normal 
distribution,  and  discriminant  analysis.  Probit  regression,  although  assuming 
a  similar  shape  to  the  logistic  function  for  the  regression  relationship  be¬ 
tween  X[3  and  Prob{T  =  1},  involves  more  cumbersome  calculations,  and 
there  is  no  natural  interpretation  of  its  regression  parameters.  In  the  past, 
discriminant  analysis  has  been  the  predominant  method  since  it  is  the  sim¬ 
plest  computationally.  However,  it  makes  more  assumptions  than  logistic  re¬ 
gression.  The  model  used  in  discriminant  analysis  is  stated  in  terms  of  the 
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distribution  of  X  given  the  outcome  group  Y,  even  though  one  is  seldom  in¬ 
terested  in  the  distribution  of  the  predictors  per  se.  The  discriminant  model 
has  to  be  inverted  using  Bayes’  rule  to  derive  the  quantity  of  primary  in¬ 
terest,  Prob{T  =  1}.  By  contrast,  the  logistic  model  is  a  direct  probability 
model  since  it  is  stated  in  terms  of  Prob{T  =  1|X}.  Since  the  distribution 
of  a  binary  random  variable  Y  is  completely  defined  by  the  true  probability 
that  Y  =  1  and  since  the  model  makes  no  assumption  about  the  distribu¬ 
tion  of  the  predictors,  the  logistic  model  makes  no  distributional  assumptions 
whatsoever. 


10.1.1  Model  Assumptions  and  Interpretation 
of  Parameters 

Since  the  logistic  model  is  a  direct  probability  model,  its  only  assumptions 
relate  to  the  form  of  the  regression  equation.  Regression  assumptions  are 
verifiable,  unlike  the  assumption  of  multivariate  normality  made  by  discrimi¬ 
nant  analysis.  The  logistic  model  assumptions  are  most  easily  understood  by 
transforming  Prob{T  =  1}  to  make  a  model  that  is  linear  in  X/3: 

iogit{y  =  i\x}  =  iogit(P)  =  iog[p/(i  -  p)} 

=  Xp,  (10.6) 

where  P  =  Prob{T  =  1|X}.  Thus  the  model  is  a  linear  regression  model  in 
the  log  odds  that  Y  =  1  since  logit  (P)  is  a  weighted  sum  of  the  Xs.  If  all 
effects  are  additive  (i.e. ,  no  interactions  are  present),  the  model  assumes  that 
for  every  predictor  Xj, 

logit {T  =  1|-X"}  =  Po  T  P \X\  -)-•••  T  PjXj  +  . . .  T  Pk-^-k 

=  PjXj  +  C,  (10.7) 

where  if  all  other  factors  are  held  constant,  C  is  a  constant  given  by 

C  =  po  +  P\X\  +  . . .  +  pj—iXj—i  +  Pj+iXj+i  +  . . .  +  pkXf.  (10.8) 

The  parameter  Pj  is  then  the  change  in  the  log  odds  per  unit  change  in 
Xj  if  Xj  represents  a  single  factor  that  is  linear  and  does  not  interact  with 
other  factors  and  if  all  other  factors  are  held  constant.  Instead  of  writing  this 
relationship  in  terms  of  log  odds,  it  could  just  as  easily  be  written  in  terms 
of  the  odds  that  Y  =  1: 


odds{T  =  l\X}  =  exp  (X/3), 


(10.9) 
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and  if  all  factors  other  than  X3  are  held  constant, 


odds{y  =  l\X}  =  exp(f3jXj  +  C)  =  exp (/3jXj)  exp(C) 


(10.10) 


The  regression  parameters  can  also  be  written  in  terms  of  odds  ratios.  The 
odds  that  Y  —  l  when  X3  is  increased  by  d,  divided  by  the  odds  at  X3  is 


odds{T  =  l|Xi,  X2, . . . ,  Xj  +  d, . . . ,  Xk} 
odds{T  =  l|Xi,  X2 , . . . ,  Xj, . . . ,  Xk} 

exp  [f3j(Xj  +  d)]  exp(C) 
exp  ( /3j  Xj )  exp  ( C) ] 

=  exp  [ft j  Xj  +  [3 jd  —  f3j  Xj  ]  =  exp  ( f3j  d) . 


(10.11) 


Thus  the  effect  of  increasing  Xj  by  d  is  to  increase  the  odds  that  Y  =  1  by 
a  factor  of  exp(jSjd),  or  to  increase  the  log  odds  that  Y  =  1  by  an  increment 
of  f3jd.  In  general,  the  ratio  of  the  odds  of  response  for  an  individual  with 
predictor  variable  values  X *  compared  with  an  individual  with  predictors 
X  is 


X *  :  X  odds  ratio  =  exp (X*/?)/  exp (X/3) 

=  exp[(X*  —  X)f3].  (10.12) 

Now  consider  some  special  cases  of  the  logistic  multiple  regression  model. 
If  there  is  only  one  predictor  X  and  that  predictor  is  binary,  the  model  can 
be  written 


logit{T 


1|X  =  0}  =  A) 


iogit{y  =  i|x  =  i}  =  /30  +  A 


(10.13) 


Here  /3q  is  the  log  odds  of  Y  =  1  when  X  =  0.  By  subtracting  the  two 
equations  above,  it  can  be  seen  that  j3\  is  the  difference  in  the  log  odds 
when  X  =  1  as  compared  with  X  =  0,  which  is  equivalent  to  the  log  of  the 
ratio  of  the  odds  when  X  =  1  compared  with  the  odds  when  X  =  0.  The 
quantity  exp(A)  is  the  odds  ratio  for  X  =  1  compared  with  X  =  0.  Letting 
po  _  prob{y  =  1|X  =  0}  and  Pl  =  Probjy  =  1|X  =  1},  the  regression 
parameters  are  interpreted  by 

/3o  =  logit(P°)  =  log[P°/(l  -  P0)] 

Pi  =  logit  (P1)  -  logit  (P°) 

=  iog[py(i  -  p1)]  -  iog[p°/(i  -  p°y 
=  log{[P1/(l-P1)]/[P°/(l-P0)]}. 


(10.14) 
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Since  there  are  only  two  quantities  to  model  and  two  free  parameters, 
there  is  no  way  that  this  two-sample  model  can’t  fit;  the  model  in  this  case 
is  essentially  fitting  two  cell  proportions.  Similarly,  if  there  are  g  —  1  dummy 
indicator  As  representing  g  groups,  the  ANOVA-type  logistic  model  must 
always  fit. 

If  there  is  one  continuous  predictor  A,  the  model  is 


iogit{y  =  i\x}  =  /3o  +  fhx,  (10.15) 

and  without  further  modification  (e.g.,  taking  log  transformation  of  the  pre¬ 
dictor),  the  model  assumes  a  straight  line  in  the  log  odds,  or  that  an  increase 
in  X  by  one  unit  increases  the  odds  by  a  factor  of  exp(/?i). 

Now  consider  the  simplest  analysis  of  covariance  model  in  which  there  are 
two  treatments  (indicated  by  X\  =  0  or  1)  and  one  continuous  covariable 
(X2).  The  simplest  logistic  model  for  this  setup  is 


logit{T  =  1\X}  =  A)  +  PiXx  +  /32X2, 
which  can  be  written  also  as 


iogit{y  =  i|Xi  = 
iogit{y  =  i|Xi  = 


o,x2}  = /3o  +  /32x2 

1,^2}  =  /3o+A+/32X2. 


(10.16) 


(10.17) 


The  X\  —  1  :  X\  =  0  odds  ratio  is  exp(/?i),  independent  of  X2.  The  odds 
ratio  for  a  one-unit  increase  in  X2  is  exp(/?2),  independent  of  X\. 

This  model,  with  no  term  for  a  possible  interaction  between  treatment 
and  covariable,  assumes  that  for  each  treatment  the  relationship  between  X2 
and  log  odds  is  linear,  and  that  the  lines  have  equal  slope;  that  is,  they  are 
parallel.  Assuming  linearity  in  A2,  the  only  way  that  this  model  can  fail  is 
for  the  two  slopes  to  differ.  Thus,  the  only  assumptions  that  need  verification 
are  linearity  and  lack  of  interaction  between  X\  and  X2. 

To  adapt  the  model  to  allow  or  test  for  interaction,  we  write 


logit{T  =  1\X}  =  p0  +  PiX!  +  foX2  +  /33X3,  (10.18) 


where  the  derived  variable  X3  is  defined  to  be  X\X2.  The  test  for  lack  of 
interaction  (equal  slopes)  is  Hq  :  /3s  =  0.  The  model  can  be  amplified  as 


logitjy 

logitjy 


0,x2}  =  A)  +  /32X2 

1,X2}  =  /3q  +  P1+P2X2+P3X2 

=  Po  +  P'2X2, 


(10.19) 
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Table  10.1  Effect  of  an  odds  ratio  of  two  on  various  risks 


Without  Risk  Factor 

With  Risk  Factor 

Probability 

Odds 

Odds  Probability 

.2 

.25 

.5 

.33 

.5 

1 

2 

.67 

.8 

4 

8 

.89 

.9 

9 

18 

.95 

.98 

49 

98 

.99 

where  /3'0  =  po  +  Pi  and  p'2  =  P‘2  +  p3-  The  model  with  interaction  is  therefore 
equivalent  to  fitting  two  separate  logistic  models  with  X 2  as  the  only  predic¬ 
tor,  one  model  for  each  treatment  group.  Here  the  X\  =  1  :  X\  =  0  odds 
ratio  is  exp(/?i  +  P3X2). 


10.1.2  Odds  Ratio ,  Risk  Ratio,  and  Risk  Difference 

As  discussed  above,  the  logistic  model  quantifies  the  effect  of  a  predictor  in 
terms  of  an  odds  ratio  or  log  odds  ratio.  An  odds  ratio  is  a  natural  descrip¬ 
tion  of  an  effect  in  a  probability  model  since  an  odds  ratio  can  be  constant. 
For  example,  suppose  that  a  given  risk  factor  doubles  the  odds  of  disease. 
Table  10.1  shows  the  effect  of  the  risk  factor  for  various  levels  of  initial  risk. 

Since  odds  have  an  unlimited  range,  any  positive  odds  ratio  will  still  yield 
a  valid  probability.  If  one  attempted  to  describe  an  effect  by  a  risk  ratio,  the 
effect  can  only  occur  over  a  limited  range  of  risk  (probability).  For  example,  a 
risk  ratio  of  2  can  only  apply  to  risks  below  .5;  above  that  point  the  risk  ratio 
must  diminish.  (Risk  ratios  are  similar  to  odds  ratios  if  the  risk  is  small.) 
Risk  differences  have  the  same  difficulty;  the  risk  difference  cannot  be  con¬ 
stant  and  must  depend  on  the  initial  risk.  Odds  ratios,  on  the  other  hand,  can 
describe  an  effect  over  the  entire  range  of  risk.  An  odds  ratio  can,  for  example, 
describe  the  effect  of  a  treatment  independently  of  covariables  affecting  risk. 
Figure  10.2  depicts  the  relationship  between  risk  of  a  subject  without  the  risk 
factor  and  the  increase  in  risk  for  a  variety  of  relative  increases  (odds  ratios). 
It  demonstrates  how  absolute  risk  increase  is  a  function  of  the  baseline  risk. 
Risk  increase  will  also  be  a  function  of  factors  that  interact  with  the  risk  fac¬ 
tor,  that  is,  factors  that  modify  its  relative  effect.  Once  a  model  is  developed 
for  estimating  Prob{T  =  1|X},  this  model  can  easily  be  used  to  estimate  the 
absolute  risk  increase  as  a  function  of  baseline  risk  factors  as  well  as  inter¬ 
acting  factors.  Let  X\  be  a  binary  risk  factor  and  let  A  =  {X2, . . . ,  Xp}  be 
the  other  factors  (which  for  convenience  we  assume  do  not  interact  with  X\). 
Then  the  estimate  of  Prob{T  =  l|Xi  =  1,  A}  —  Prob{T  =  l|Xi  =  0,  A}  is 
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Fig.  10.2  Absolute  benefit  as  a  function  of  risk  of  the  event  in  a  control  subject  and 
the  relative  effect  (odds  ratio)  of  the  risk  factor.  The  odds  ratios  are  given  for  each 
curve. 


Table  10.2  Example  binary  response  data 


Females  Age: 

Response: 
Males  Age: 
Response: 


37  39  39  42  47  48  48  52  53  55  56  57  58  58  60  64  65  68  68  70 
00000010000001001111 
34  38  40  40  41  43  43  43  44  46  47  48  48  50  50  52  55  60  61  61 
11000111001110111111 


1 


1  +  exp  —  J3  o  4-  +  $2X2  +  •  •  •  +  $PXP 

1 

1  +  exp  —  _/S 0  P2X2  +  •  •  •  +  /3PXP 

1 


(10.20) 


1  +  (A^)exp(-A) 


R, 


R 


/\ 

where  R  is  the  estimate  of  the  baseline  risk,  Prob{T  =  1\X\  =  0}.  The  risk 


difference  estimate  can  be  plotted  against  R  or  against  levels  of  variables  in  A 
to  display  absolute  risk  increase  against  overall  risk  (Figure  10.2)  or  against 
specific  subject  characteristics. 


10.1.3  Detailed  Example 

Consider  the  data  in  Table  10.2.  A  graph  of  the  data,  along  with  a  fitted 
logistic  model  (described  later),  appears  in  Figure  10.3.  The  graph  also  dis¬ 
plays  proportions  of  responses  obtained  by  stratifying  the  data  by  sex  and 


226 


10  Binary  Logistic  Regression 


age  group  (<  45,45  —  54,  >  55).  The  age  points  on  the  abscissa  for  these 
groups  are  the  overall  mean  ages  in  the  three  age  intervals  (40.2,  49.1,  and 
61.1,  respectively). 
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Descriptive  statistics  for  assessing  the  association  between  sex  and  re¬ 
sponse,  age  group  and  response,  and  age  group  and  response  stratified  by 
sex  are  found  below.  Corresponding  fitted  logistic  models,  with  sex  coded  as 
0  =  female,  1  =  male  are  also  given.  Models  were  fitted  first  with  sex  as  the 
only  predictor,  then  with  age  as  the  (continuous)  predictor,  then  with  sex  and 
age  simultaneously.  First  consider  the  relationship  between  sex  and  response, 
ignoring  the  effect  of  age. 


Pr[response] 
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Fig.  10.3  Data,  subgroup  proportions,  and  fitted  logistic  model,  with  0.95  pointwise 
confidence  bands 


sex 

response 

Frequency- 
Row  Pet 

0 

1 

Total 

Odds/Log 

F 

14 

6 

20 

6/14=. 429 

70.00 

30.00 

-.847 

M 

6 

14 

20 

14/6=2.33 

30.00 

70.00 

.847 

Total 

20 

20 

40 

M:F  odds  ratio  =  (14/6) / (6/14)  =  5.44,  log=1.695 

Statistics  for  sex  x  response 

Statistic  d.f.  Value  P 


X 2  1  6.400  0.011 

Likelihood  Ratio  x2  1  6.583  0.010 
Parameter  Estimate  Std  Err  Wald  %2  P 


Po 

Pi 


0.8473  0.4880  3.0152 

1.6946  0.6901  6.0305  0.0141 
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Note  that  the  estimate  of  [3 o,  do  is  the  log  odds  for  females  and  that  di  is  the 

/\  /s 

log  odds  (M:F)  ratio.  do  +  di  =  .847,  the  log  odds  for  males.  The  likelihood 
ratio  test  for  Hq  :  no  effect  of  sex  on  probability  of  response  is  obtained  as 
follows. 

Log  likelihood  (/+  =  0)  :  —27.727 
Log  likelihood  (max)  :  —24.435 

LR  x2(tf0  :  Pi  =  0)  :  -2(-27.727  -  -24.435)  =  6.584. 

(Note  the  agreement  of  the  LR  y2  with  the  contingency  table  likelihood  ratio 
X2,  and  compare  6.584  with  the  Wald  statistic  6.03.) 

Next,  consider  the  relationship  between  age  and  response,  ignoring  sex. 


age 

Frequency 
Row  Pet 

response 

0 

1 

Total 

Odds/Log 

<45 

8 

5 

13 

5/8= . 625 

61.5 

38.4 

-.47 

45-54 

6 

6 

12 

6/6=1 

50.0 

50.0 

0 

55+ 

6 

9 

15 

9/6=1. 5 

40.0 

60.0 

.405 

Total 

20 

20 

40 

55+  :  <45  odds  ratio  =  (9/6)/ (5/8)  =2.4,  log=.875 
Parameter  Estimate  Std  Err  Wald  x2  P 

do  -2.7338  1.8375  2.2134  0.1368 

di  0.0540  0.0358  2.2763  0.1314 

The  estimate  of  di  is  in  rough  agreement  with  that  obtained  from  the 
frequency  table.  The  55+  :  <  45  log  odds  ratio  is  .875,  and  since  the  respective 
mean  ages  in  the  55+  and  <45  age  groups  are  61.1  and  40.2,  an  estimate  of 
the  log  odds  ratio  increase  per  year  is  .875/(61.1  —  40.2)  =  .875/20.9  =  .042. 

The  likelihood  ratio  test  for  Hq  :  no  association  between  age  and  response 
is  obtained  as  follows. 

Log  likelihood  (/+  =  0)  :  —27.727 
Log  likelihood  (max)  :  —26.511 

LR  x2(#o  :  di  —  0)  :  -2(-27.727  - 26.511)  =  2.432. 

(Compare  2.432  with  the  Wald  statistic  2.28.) 

Next  we  consider  the  simultaneous  association  of  age  and  sex  with 
response. 
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sex=F 


age  response 

Frequency 


Row  Pet 

0 

1 

Total 

<45 

4 

0 

4 

100.0 

0.0 

45-54 

4 

1 

5 

80.0 

20.0 

55+ 

6 

5 

11 

54.6 

45.4 

Total 

14 

6 

20 

sex=M 

age 

response 

Frequency 

Row  Pet 

0 

1 

Total 

<45 

4 

5 

9 

44.4 

55.6 

45-54 

2 

5 

7 

28.6 

71.4 

55+ 

0 

4 

4 

0.0 

100.0 

Total 

6 

14 

20 

A  logistic  model  for  relating  sex  and  age  simultaneously  to  response  is 
given  below. 


Parameter  Estimate  Std  Err  Wald  y2  P 


do 

di  (sex) 
fc  (age) 


-9.8429  3.6758 
3.4898  1.1992 

0.1581  0.0616 


7.1706  0.0074 
8.4693  0.0036 
6.5756  0.0103 


Likelihood  ratio  tests  are  obtained  from  the  information  below. 


Log  likelihood  (di  =  0,  /?2  =  0) 
Log  likelihood  (max) 

Log  likelihood  (di  =  0) 

Log  likelihood  (^2  =  0) 

LR  x2  (Ho  :  Pi  =  =  0) 

LR  x2  (H0  :  di  —  0)  sex  age 
LR  x2  (Hq  :  (32  =  0)  age  sex 


-27.727 
-19.458 
-26.511 
-24.435 
—2(— 27.727 
—2(— 26.511 
—2(— 24.435 


-19.458)  =  16.538 
-19.458)  =  14.106 
-19.458)  =  9.954. 


The  14.1  should  be  compared  with  the  Wald  statistic  of  8.47,  and  9.954 
should  be  compared  with  6.58.  The  fitted  logistic  model  is  plotted  separately 
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for  females  and  males  in  Figure  10.3.  The  fitted  model  is 


logit{Response 


1 1  sex, age} 


—9.84  +  3.49  x  sex  +  .158  x  age,  (10.21) 


where  as  before  sex  =  0  for  females,  1  for  males.  For  example,  for  a  40-year- 
old  female,  the  predicted  logit  is  —9.84  +  .158(40)  =  —3.52.  The  predicted 
probability  of  a  response  is  1/ [1  +  exp(3.52)]  =  .029.  For  a  40-year-old  male, 
the  predicted  logit  is  —9.84  +  3.49  +  .158(40)  =  —.03,  with  a  probability 
of  .492. 


10.1.4  Design  Formulations 

The  logistic  multiple  regression  model  can  incorporate  the  same  designs  as 
can  ordinary  linear  regression.  An  analysis  of  variance  (ANOVA)  model  for 
a  treatment  with  k  levels  can  be  formulated  with  k  —  1  dummy  variables. 
This  logistic  model  is  equivalent  to  a  2  x  fc  contingency  table.  An  analysis 
of  covariance  logistic  model  is  simply  an  ANOVA  model  augmented  with 
covariables  used  for  adjustment. 

One  unique  design  that  is  interesting  to  consider  in  the  context  of  logistic 
models  is  a  simultaneous  comparison  of  multiple  factors  between  two  groups. 
Suppose,  for  example,  that  in  a  randomized  trial  with  two  treatments  one 
wished  to  test  whether  any  of  10  baseline  characteristics  are  mal-distributed 
between  the  two  groups.  If  the  10  factors  are  continuous,  one  could  perform  a 
two-sample  Wilcoxon-Mann- Whitney  test  or  a  t-test  for  each  factor  (if  each 
is  normally  distributed).  However,  this  procedure  would  result  in  multiple 
comparison  problems  and  would  also  not  be  able  to  detect  the  combined  ef¬ 
fect  of  small  differences  across  all  the  factors.  A  better  procedure  would  be  a 
multivariate  test.  The  Hotelling  T2  test  is  designed  for  just  this  situation.  It 
is  a  k- variable  extension  of  the  one- variable  unpaired  t-test.  The  T2  test,  like 
discriminant  analysis,  assumes  multivariate  normality  of  the  k  factors.  This 
assumption  is  especially  tenuous  when  some  of  the  factors  are  polytomous.  A 
better  alternative  is  the  global  test  of  no  regression  from  the  logistic  model. 
This  test  is  valid  because  it  can  be  shown  that  Hq  :  mean  X  is  the  same  for 
both  groups  (=  Hq  :  mean  X  does  not  depend  on  group  =  Hq  :  mean  X 
group  =  constant)  is  true  if  and  only  if  Hq  :  Prob{group| X}  =  constant.  Thus 
k  factors  can  be  tested  simultaneously  for  differences  between  the  two  groups 
using  the  binary  logistic  model,  which  has  far  fewer  assumptions  than  does  the 
Hotelling  T2  test.  The  logistic  global  test  of  no  regression  (with  k  d.f.)  would 
be  expected  to  have  greater  power  if  there  is  non-normality.  Since  the  logistic 
model  makes  no  assumption  regarding  the  distribution  of  the  descriptor  vari¬ 
ables,  it  can  easily  test  for  simultaneous  group  differences  involving  a  mixture 
of  continuous,  binary,  and  nominal  variables.  In  observational  studies,  such 
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models  for  treatment  received  or  exposure  (propensity  score  models)  hold 
great  promise  for  adjusting  for  confounding.117, 380,526,530,531 

O’Brien  has  developed  a  general  test  for  comparing  group  1  with 
group  2  for  a  single  measurement.  His  test  detects  location  and  scale  dif¬ 
ferences  by  fitting  a  logistic  model  for  Prob{Group  2}  using  X  and  X 2  as 
predictors. 

For  a  randomized  study  where  adjustment  for  confounding  is  seldom  neces¬ 
sary,  adjusting  for  covariables  using  a  binary  logistic  model  results  in  increases 
in  standard  errors  of  regression  coefficients.52 /  This  is  the  opposite  of  what 
happens  in  linear  regression  where  there  is  an  unknown  variance  parameter 
that  is  estimated  using  the  residual  squared  error.  Fortunately,  adjusting  for 
covariables  using  logistic  regression,  by  accounting  for  subject  heterogeneity, 
will  result  in  larger  regression  coefficients  even  for  a  randomized  treatment 
variable.  The  increase  in  estimated  regression  coefficients  more  than  offsets 
the  increase  in  standard  error284,285,527,588. 


10.2  Estimation 


10.2.1  Maximum  Likelihood  Estimates 


The  parameters  in  the  logistic  regression  model  are  estimated  using  the  maxi¬ 
mum  likelihood  (MF)  method.  The  method  is  based  on  the  same  principles  as 
the  one-sample  proportion  example  described  in  Section  9.1.  The  difference 
is  that  the  general  logistic  model  is  not  a  single  sample  or  a  two-sample  prob¬ 
lem.  The  probability  of  response  for  the  ith  subject  depends  on  a  particular 
set  of  predictors  and  in  fact  the  list  of  predictors  may  not  be  the  same 
for  any  two  subjects.  Denoting  the  response  and  probability  of  response  of 
the  zth  subject  by  Y{  and  P^,  respectively,  the  model  states  that 


Pi  =  Prob{l^ 


[1  +  exp  {-Xi/3) 


l 


(10.22) 


The  likelihood  of  an  observed  response  Y;b  given  predictors  Xi  and  the  un¬ 
known  parameters  [3  is 


(10.23) 


The  joint  likelihood  of  all  responses  Yi,  I2, . . . ,  Yn  is  the  product  of  these 
likelihoods  for  i  =  1, . . . ,  n.  The  likelihood  and  log  likelihood  functions  are 
rewritten  by  using  the  definition  of  Pi  above  to  allow  them  to  be  recognized 
as  a  function  of  the  unknown  parameters  (3.  Except  in  simple  special  cases 
(such  as  the  /c-sample  problem  in  which  all  Xs  are  dummy  variables),  the 
MF  estimates  (MLE)  of  [3  cannot  be  written  explicitly.  The  Newton-Raphson 
method  described  in  Section  9.4  is  usually  used  to  solve  iteratively  for  the 
list  of  values  [3  that  maximize  the  log  likelihood.  The  MLEs  are  denoted  by 
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/?.  The  inverse  of  the  estimated  observed  information  matrix  is  taken  as  the 

/\ 

estimate  of  the  variance-covariance  matrix  of  f3. 

Under  Hq  :  (3i  =  ^2  =  ...  =  /3k  =  0,  the  intercept  parameter  f3o  can  be 
estimated  explicitly  and  the  log  likelihood  under  this  global  null  hypothesis 
can  be  computed  explicitly.  Under  the  global  null  hypothesis,  Pi  =  P  = 
[1  +  exp(— /3o)]_i  and  the  MLE  of  P  is  P  =  s/n  where  s  is  the  number  of 
responses  and  n  is  the  sample  size.  The  MLE  of  /?o  is  /?o  =  logit(P).  The  log 
likelihood  under  this  null  hypothesis  is 

s  log (P)  +  (n  —  s)  log(l  —  P) 

=  s  log(s/n)  +  (n  —  s)  log[(n  —  s)/n\  (10.24) 

=  s  log  s  +  (n  —  s)  log (n  —  s)  —  n  log(n). 


10.2.2  Estimation  of  Odds  Ratios  and  Probabilities 


Once  f3  is  estimated,  one  can  estimate  any  log  odds,  odds,  or  odds  ratios. 

A 

The  MLE  of  the  Xj  +  1  :  X3  log  odds  ratio  is  /5y ,  and  the  estimate  of  the 

/\ 

Xj  +  d  :  Xj  log  odds  ratio  is  (3jd ,  all  other  predictors  remaining  constant 
(assuming  the  absence  of  interactions  and  nonlinearities  involving  Xj).  For 
large  enough  samples,  the  MLEs  are  normally  distributed  with  variances  that 
are  consistently  estimated  from  the  estimated  variance-covariance  matrix. 
Letting  z  denote  the  1  —  a/2  critical  value  of  the  standard  normal  distribution, 
a  two-sided  1  —  a  confidence  interval  for  the  log  odds  ratio  for  a  one-unit 

/\  /s 

increase  in  Xj  is  [/3j  —  zs,/3j  +  zs\,  where  s  is  the  estimated  standard  error 

/\ 

of  / 3j .  (Note  that  for  a  =  .05,  i.e.,  for  a  95%  confidence  interval,  z  =  1.96.) 

A  theorem  in  statistics  states  that  the  MLE  of  a  function  of  a  parameter 

is  that  same  function  of  the  MLE  of  the  parameter.  Thus  the  MLE  of  the 

/\ 

Xj  +  1  :  Xj  odds  ratio  is  exp (/3j).  Also,  if  a  1  —  a  confidence  interval  of  a 
parameter  /3  is  [c,  d\  and  f{u)  is  a  one-to-one  function,  a  1  —  a  confidence 
interval  of  /(/?)  is  [/(c),  f(d)\.  Thus  a  1  —  a  confidence  interval  for  the  Xj  + 1  : 

/S 

Xj  odds  ratio  is  exp[/3j  ±zs\.  Note  that  while  the  confidence  interval  for  (3j  is 

.A. 

symmetric  about  ///,  the  confidence  interval  for  exp (//,-)  is  not.  By  the  same 
theorem  just  used,  the  MLE  of  Pi  =  Prob{U  =  1| Xi}  is 


Pi  =  [1  +  exp  (-Xifi) 


-1 


(10.25) 


A  confidence  interval  for  Pi  could  be  derived  by  computing  the  standard 
error  of  yielding  a  symmetric  confidence  interval.  However,  such  an  in¬ 
terval  would  have  the  disadvantage  that  its  endpoints  could  fall  below  zero 
or  exceed  one.  A  better  approach  uses  the  fact  that  for  large  samples  X/3 

S\ 

is  approximately  normally  distributed.  An  estimate  of  the  variance  of  Xf3 
in  matrix  notation  is  XV X'  where  V  is  the  estimated  variance-covariance 
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matrix  of  0  (see  Equation  9.51).  This  variance  is  the  sum  of  all  variances  and 

/\ 

covariances  of  /?  weighted  by  squares  and  products  of  the  predictors.  The  es- 

/\ 

timated  standard  error  of  Xf3,  s ,  is  the  square  root  of  this  variance  estimate. 
A  1  —  a  confidence  interval  for  Pi  is  then 


{1  +  exp[—  (Xif)  =b  zs)]} 


-l 


(10.26) 


10.2.3  Minimum  Sample  Size  Requirement 


Suppose  there  were  no  covariates,  so  that  the  only  parameter  in  the  model  is 
the  intercept.  What  is  the  sample  size  required  to  allow  the  estimate  of  the 
intercept  to  be  precise  enough  so  that  the  predicted  probability  is  within  0.1 
of  the  true  probability  with  0.95  confidence,  when  the  true  intercept  is  in  the 
neighborhood  of  zero?  The  answer  is  n=96.  What  if  there  were  one  covariate, 
and  it  was  binary  with  a  prevalence  of  |?  One  would  need  96  subjects  with 
X  =  0  and  96  with  X  =  1  to  have  an  upper  bound  on  the  margin  of  error 
for  estimating  Prob{T  =  1\X  =  x}  not  exceed  0.1  for  either  value  of  xa. 

Now  consider  a  very  simple  single  continuous  predictor  case  in  which  X 
has  a  normal  distribution  with  mean  zero  and  standard  deviation  cr,  with  the 
true  Prob{T  =  1\X  =  x}  =  [1  +  exp(— x)]  1 .  The  expected  number  of  events 
is  ^b.  The  following  simulation  answers  the  question  “What  should  n  be  so 

A 

that  the  expected  maximum  absolute  error  (over  x  G  [—1.5, 1.5])  in  P  is  less 
than  e?” 


sigmas 

c(  .5  ,  .  75  ,  1  , 

1 

.25  ,  1 

. 5 ,  1.75,  2 , 

2.5,  3 , 

4) 

ns 

seq  (25  ,  300  , 

fey 

=  25) 

ns  im 

<— 

1000 

xs 

seq (-1.5,  1.5 

9 

length 

=  200) 

pactual 

plogi s  ( xs ) 

dn  V-  1  i 

st  ( 

s igma  =  f  ormat  ( 

s  i 

gmas  )  , 

n=f ormat (ns 

)) 

maxerr  V-  N1  V-  array  (NA 

9 

c ( length (sigmas )  , 

length  (ns 

)  )  ,  dn) 

require ( 

rms 

) 

i  <-  0 

f  or ( s  in 

s  i 

gmas  )  { 

1  <—  1 

+ 

1 

j  «-  0 

f  or  (n 

in 

ns )  { 

a  The  general  formula  for  the  sample  size  required  to  achieve  a  margin  of  error  of  S  in 
estimating  a  true  probability  of  0  at  the  0.95  confidence  level  is  n  =  (hr)2*  0(1  —  0). 
Set  0  =  -1  (intercepts)  for  the  worst  case. 

b  The  R  code  can  easily  be  modified  for  other  event  frequencies,  or  the  minimum  of 
the  number  of  events  and  non-events  for  a  dataset  at  hand  can  be  compared  with  If 
in  this  simulation.  An  average  maximum  absolute  error  of  0.05  corresponds  roughly 
to  a  half- width  of  the  0.95  confidence  interval  of  0.1. 
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j  <-  j  +  1 

nl  maxe  0 

for(k  in  l:nsim)  { 
x  rnorm (n ,  0,  s) 

P  plogis (x) 

y  if else ( runif (n)  <  P,  1,  0) 

nl  nl  +  sum(y) 

beta  lrm.fit(x,  y) $coef f icients 

phat  <—  plogis  (beta  [1]  +  beta  [2]  *  xs) 

maxe  maxe  +  max(abs(phat  -  pactual)) 

} 

nl  nl/nsim 

maxe  V-  maxe/nsim 
maxerr[i,j]  maxe 

Nl  [i  ,  j]  <-  nl 

} 

} 

xrange  range (xs) 

simerr  V-  Hist  (Nl  ,  maxerr  ,  sigmas,  ns,  nsim  ,  xrange) 

maxe  reShape (maxerr ) 

#  Figure  10.4 

xYplot  (maxerr  ~  n,  groups = s igma  ,  data  =  maxe  , 

ylab  =  expr e s s i on  (paste  ('  Average  Maximum  ', 
abs(hat(P)  -  P))), 

type='l',  lty  =  rep ( 1 : 2 ,  5),  label . curve =FALSE  , 

abline =list (h=c ( . 15 ,  .1,  .05),  col =gray ( . 85 ) ) ) 

Key(.8,  .68,  other =list ( cex= . 7 , 

title  =  expression  igma  )  )  ) 


10.3  Test  Statistics 

The  likelihood  ratio,  score,  and  Wald  statistics  discussed  earlier  can  be  used 
to  test  any  hypothesis  in  the  logistic  model.  The  likelihood  ratio  test  is  gen¬ 
erally  preferred.  When  true  parameters  are  near  the  null  values  all  three 
statistics  usually  agree.  The  Wald  test  has  a  significant  drawback  when  the 
true  parameter  value  is  very  far  from  the  null  value.  In  such  case  the  stan- 
dard  error  estimate  becomes  too  large.  As  increases  from  0,  the  Wald  test 
statistic  for  Hq  :  /3j  =  0  becomes  larger,  but  after  a  certain  point  it  becomes 
smaller.  The  statistic  will  eventually  drop  to  zero  if  f3j  becomes  infinite.2 /8 
Infinite  estimates  can  occur  in  the  logistic  model  especially  when  there  is  a 
binary  predictor  whose  mean  is  near  0  or  1.  Wald  statistics  are  especially 
problematic  in  this  case.  For  example,  if  10  out  of  20  males  had  a  disease  and 
5  out  of  5  females  had  the  disease,  the  female  :  male  odds  ratio  is  infinite  and 
so  is  the  logistic  regression  coefficient  for  sex.  If  such  a  situation  occurs,  the 
likelihood  ratio  or  score  statistic  should  be  used  instead  of  the  Wald  statistic. 
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Fig.  10.4  Simulated  expected  maximum  error  in  estimating  probabilities  for  x  E 
[—1.5,  1.5]  with  a  single  normally  distributed  X  with  mean  zero 


For  /c-sample  (ANOVA-type)  logistic  models,  logistic  model  statistics  are 
equivalent  to  contingency  table  y2  statistics.  As  exemplified  in  the  logistic 
model  relating  sex  to  response  described  previously,  the  global  likelihood 
ratio  statistic  for  all  dummy  variables  in  a  /c-sample  model  is  identical  to  the 
contingency  table  (/c-sample  binomial)  likelihood  ratio  y2  statistic.  The  score 
statistic  for  this  same  situation  turns  out  to  be  identical  to  the  k  —  1  degrees 
of  freedom  Pearson  y2  for  a  k  x  2  table. 

As  mentioned  in  Section  2.6,  it  can  be  dangerous  to  interpret  individual 
parameters,  make  pairwise  treatment  comparisons,  or  test  linearity  if  the 
overall  test  of  association  for  a  factor  represented  by  multiple  parameters  is 
insignificant. 


10.4  Residuals 


Several  types  of  residuals  can  be  computed  for  binary  logistic  model  fits.  Many 
of  these  residuals  are  used  to  examine  the  influence  of  individual  observations 
on  the  fit.  The  partial  residual  can  be  used  for  directly  assessing  how  each 
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predictor  should  be  transformed.  For  the  ith  observation,  the  partial  residual 
for  the  j th  element  of  X  is  defined  by 

Yi  -  Pi  , 

Tjj  =  foXij  +  — - - J-,  10.27 

where  is  the  value  of  the  jth  variable  in  the  ith  observation,  Yi  is  the 

/s. 

corresponding  value  of  the  response,  and  Pi  is  the  predicted  probability  that 
Yi  =  1.  A  smooth  plot  (using,  e.g.,  loess)  of  Xij  against  will  provide  an 
estimate  of  how  Xj  should  be  transformed,  adjusting  for  the  other  Xs  (using 
their  current  transformations).  Typically  one  tentatively  models  Xj  linearly 
and  checks  the  smoothed  plot  for  linearity.  A  /7-shaped  relationship  in  this 
plot,  for  example,  indicates  that  a  squared  term  or  spline  function  needs  to 
be  added  for  Xj.  This  approach  does  assume  additivity  of  predictors. 


10.5  Assessment  of  Model  Fit 

As  the  logistic  regression  model  makes  no  distributional  assumptions,  only 
the  assumptions  of  linearity  and  additivity  need  to  be  verified  (in  addition 
to  the  usual  assumptions  about  independence  of  observations  and  inclusion 
of  important  covariables).  In  ordinary  linear  regression  there  is  no  global 
test  for  lack  of  model  fit  unless  there  are  replicate  observations  at  various 
settings  of  X.  This  is  because  ordinary  regression  entails  estimation  of  a 
separate  variance  parameter  a2.  In  logistic  regression  there  are  global  tests 
for  goodness  of  fit.  Unfortunately,  some  of  the  most  frequently  used  ones  are 
inappropriate.  For  example,  it  is  common  to  see  a  deviance  test  of  goodness 
of  fit  based  on  the  “residual”  log  likelihood,  with  P-values  obtained  from  a  y2 
distribution  with  n  —  p  d.f.  This  P-value  is  inappropriate  since  the  deviance 
does  not  have  an  asymptotic  y2  distribution,  due  to  the  facts  that  the  number 
of  parameters  estimated  is  increasing  at  the  same  rate  as  n  and  the  expected 
cell  frequencies  are  far  below  five  (by  definition). 

Hosmer  and  Lemeshow30  1  have  developed  a  commonly  used  test  for  good¬ 
ness  of  fit  for  binary  logistic  models  based  on  grouping  into  deciles  of  pre¬ 
dicted  probability  and  performing  an  ordinary  y2  test  for  the  mean  predicted 
probability  against  the  observed  fraction  of  events  (using  8  d.f.  to  account 
for  evaluating  fit  on  the  model  development  sample).  The  Hosmer-Lemeshow 
test  is  dependent  on  the  choice  of  how  predictions  are  grouped303  and  it  is 
not  clear  that  the  choice  of  the  number  of  groups  should  be  independent  of  n. 
Hosmer  et  al.30  have  compared  a  number  of  global  goodness  of  fit  tests  for 
binary  logistic  regression.  They  concluded  that  the  simple  unweighted  sum  of 
squares  test  of  Copas124  as  modified  by  le  Cessie  and  van  Houwelingen  is  as 
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good  as  any.  They  used  a  normal  Z-test  for  the  sum  of  squared  errors  (nx5, 
where  B  is  the  Brier  index  in  Equation  10.35).  This  test  takes  into  account  the 
fact  that  one  cannot  obtain  a  y2  distribution  for  the  sum  of  squares.  It  also 
takes  into  account  the  estimation  of  /?.  It  is  not  yet  clear  for  which  types  of 
lack  of  fit  this  test  has  reasonable  power.  Returning  to  the  external  validation 
case  where  uncertainty  of  /?  does  not  need  to  be  accounted  for,  Stallard58 '  has 
further  documented  the  lack  of  power  of  the  original  Hosmer-Lemeshow  test 
and  found  more  power  with  a  logarithmic  scoring  rule  (deviance  test)  and  a 
X2  test  that,  unlike  the  simple  unweighted  sum  of  squares  test,  weights  each 
squared  error  by  dividing  it  by  Pi(l  —  Pi).  A  scaled  xz  distribution  seemed  to 
provide  the  best  approximation  to  the  null  distribution  of  the  test  statistics. 

More  power  for  detecting  lack  of  fit  is  expected  to  be  obtained  from  testing 
specific  alternatives  to  the  model.  In  the  model 

logit{T  =  1\X}  =  Po  +  01*1  +  02x2,  (10.28) 

where  X\  is  binary  and  X2  is  continuous,  one  needs  to  verify  that  the  log 
odds  is  related  to  X\  and  X2  according  to  Figure  10.5. 


Fig.  10.5  Logistic  regression  assumptions  for  one  binary  and  one  continuous  predic¬ 
tor 


The  simplest  method  for  validating  that  the  data  are  consistent  with  the 
no-interaction  linear  model  involves  stratifying  the  sample  by  X\  and  quan¬ 
tile  groups  (e.g.,  deciles)  of  X2.265  Within  each  stratum  the  proportion  of 
responses  P  is  computed  and  the  log  odds  calculated  from  log[P/(l  —  P)]. 
The  number  of  quantile  groups  should  be  such  that  there  are  at  least  20  (and 
perhaps  many  more)  subjects  in  each  X\  x  X2  group.  Otherwise,  probabilities 
cannot  be  estimated  precisely  enough  to  allow  trends  to  be  seen  above  “noise” 
in  the  data.  Since  at  least  3  X2  groups  must  be  formed  to  allow  assessment 
of  linearity,  the  total  sample  size  must  be  at  least  2  x  3  x  20  =  120  for  this 
method  to  work  at  all. 
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Figure  10.6  demonstrates  this  method  for  a  large  sample  size  of  3504  sub¬ 
jects  stratified  by  sex  and  deciles  of  age.  Linearity  is  apparent  for  males  while 
there  is  evidence  for  slight  interaction  between  age  and  sex  since  the  age  trend 
for  females  appears  curved. 

L 

getHdata (acath) 

acath$sex  V-  factor (ac at h$sex,  0:1,  c (  'male  1  ,  'female  '  ) ) 
dd  V-  dat  adi  st  (  acath  )  ;  opt  i ons  (  dat  adi  s t  =  '  dd  '  ) 
f  V-  lrm(sigdz  ~  res (age,  4)  *  sex,  data=acath) 


w  V-  function  (  .  .  .  ) 
with ( acath  ,  { 

plsmo (age  ,  sigdz  ,  group  =  sex  ,  fun  =  qlogis  ,  lty= 'dotted  '  , 
add  =  TRUE  ,  grid  =  TRUE) 
af  V-  cut2(age,  g=10 ,  levels . mean =TRUE ) 
prop  V-  qlogis  ( tapply  ( sigdz  ,  list(af,  sex),  mean, 

na . rm  =  TRUE ) ) 

agem  V-  as  .  numeric  (  row  .  names  (prop  )  ) 

lpoints(agem,  prop  [,  'female'],  pch=4,  col  =  '  green  '  ) 
lpoints(agem,  prop  [,  'male'],  pch=2,  col  =  '  green  '  ) 

}  )  #  Figure  10.6 

plot ( Predi ct  (f  ,  age,  sex),  ylim  =  c ( -2  ,  4)  ,  addpanel=w, 
label . curve=list (offset  =unit (0.5,  'em'))) 

The  subgrouping  method  requires  relatively  large  sample  sizes  and  does 
not  use  continuous  factors  effectively.  The  ordering  of  values  is  not  used  at  all 
between  intervals,  and  the  estimate  of  the  relationship  for  a  continuous  vari¬ 
able  has  little  resolution.  Also,  the  method  of  grouping  chosen  (e.g.,  deciles 
vs.  quintiles  vs.  rounding)  can  alter  the  shape  of  the  plot. 

In  this  dataset  with  only  two  variables,  it  is  efficient  to  use  a  nonpara- 
metric  smoother  for  age,  separately  for  males  and  females.  Nonparametric 
smoothers,  such  as  loess  used  here,  work  well  for  binary  response  vari¬ 
ables  (see  Section  2.4.7);  the  logit  transformation  is  made  on  the  smoothed 
probability  estimates.  The  smoothed  estimates  are  shown  in  Figure  10.6. 

When  there  are  several  predictors,  the  restricted  cubic  spline  function  is 
better  for  estimating  the  true  relationship  between  X 2  and  logit {Y  =  1}  for 
continuous  variables  without  assuming  linearity.  By  fitting  a  model  containing 
X2  expanded  into  k—  1  terms,  where  k  is  the  number  of  knots,  one  can  obtain 
an  estimate  of  the  transformation  of  X2  as  discussed  in  Section  2.4: 

iogit{y  =  i\x}  =  A)  +  P1X1  +  P2X2  +  hx*  +  hx'i 

=  A)+j8i  W  +  /(X2),  (10.29) 

where  X2  and  X2  are  constructed  spline  variables  (when  k  =  4).  Plotting 
the  estimated  spline  function  /(A 2)  versus  X2  will  estimate  how  the  effect  of 
X2  should  be  modeled.  If  the  sample  is  sufficiently  large,  the  spline  function 
can  be  fitted  separately  for  X\  =  0  and  X\  =  1,  allowing  detection  of  even 
unusual  interaction  patterns.  A  formal  test  of  linearity  in  X2  is  obtained  by 
testing  Hq  :  ^3  =  ^4  =  0. 
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Fig.  10.6  Logit  proportions  of  significant  coronary  artery  disease  by  sex  and  deciles 
of  age  for  n=3504  patients,  with  spline  fits  (smooth  curves).  Spline  fits  are  for  k  =  4 
knots  at  age=  36,  48,  56,  and  68  years,  and  interaction  between  age  and  sex  is  allowed. 
Shaded  bands  are  pointwise  0.95  confidence  limits  for  predicted  log  odds.  Smooth 
nonparametric  estimates  are  shown  as  dotted  curves.  Data  courtesy  of  the  Duke 
Cardiovascular  Disease  Databank. 


For  testing  interaction  between  X\  and  X2,  a  product  term  (e.g.,  X\X^) 
can  be  added  to  the  model  and  its  coefficient  tested.  A  more  general  simul¬ 
taneous  test  of  linearity  and  lack  of  interaction  for  a  two- variable  model  in 
which  one  variable  is  binary  (or  is  assumed  linear)  is  obtained  by  fitting  the 
model 


iogit{y  =  i\x}  =  p0  +  PiX!  +  P2X2  +  fcx'2  +  /?4*2 

+  kXxX2  +  PeXiX'z  +  p7X1X''  (10.30) 

and  testing  Ho  :  ^3  =  . . .  =  ft?  —  0.  This  formulation  allows  the  shape  of  the 
X2  effect  to  be  completely  different  for  each  level  of  X\.  There  is  virtually 
no  departure  from  linearity  and  additivity  that  cannot  be  detected  from  this 
expanded  model  formulation.  The  most  computationally  efficient  test  for  lack 
of  fit  is  the  score  test  (e.g.,  X\  and  X2  are  forced  into  a  tentative  model 
and  the  remaining  variables  are  candidates).  Figure  10.6  also  depicts  a  fitted 
spline  logistic  model  with  k  =  4,  allowing  for  general  interaction  between 
age  and  sex  as  parameterized  above.  The  fitted  function,  after  expanding  the 
restricted  cubic  spline  function  for  simplicity  (see  Equation  2.27),  is  given 
above.  Note  the  good  agreement  between  the  empirical  estimates  of  log  odds 
and  the  spline  fits  and  nonparametric  estimates  in  this  large  dataset. 

An  analysis  of  log  likelihood  for  this  model  and  various  sub-models  is  found 
in  Table  10.3.  The  y2  for  global  tests  is  corrected  for  the  intercept  and  the 
degrees  of  freedom  does  not  include  the  intercept. 
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Table  10.3  LR  y2  tests  for  coronary  artery  disease  risk 


Model  /  Hypothesis 

Likelihood  d.f. 
Ratio  y2 

P 

Formula 

a:  sex,  age  (linear,  no  interaction) 

766.0 

2 

b:  sex,  age,  age  x  sex 

768.2 

3 

c:  sex,  spline  in  age 

769.4 

4 

d:  sex,  spline  in  age,  interaction 

782.5 

7 

Ho  :  no  age  x  sex  interaction 

2.2 

1 

.14 

(b~ 

a) 

given  linearity 

Ho  :  age  linear  no  interaction 

3.4 

2 

.18 

(c- 

a) 

Ho  :  age  linear,  no  interaction 

16.6 

5 

.005 

(d  — 

a) 

H0  :  age  linear,  product  form 

14.4 

4 

.006 

(d  — 

b) 

interaction 

Ho  :  no  interaction,  allowing  for 

13.1 

3 

.004 

(d  — 

c) 

nonlinearity  in  age 

Table  10.4  AIC  on  y2  scale  by  number  of  knots 


k  Model  x2 

AIC 

0 

99.23 

97.23 

3 

112.69 

108.69 

4 

121.30 

115.30 

5 

123.51 

115.51 

6 

124.41 

114.51 

This  analysis  confirms  the  first  impression  from  the  graph,  namely,  that 
age  x  sex  interaction  is  present  but  it  is  not  of  the  form  of  a  simple  product 
between  age  and  sex  (change  in  slope).  In  the  context  of  a  linear  age  effect, 
there  is  no  significant  product  interaction  effect  (P  =  .14).  Without  allowing 
for  interaction,  there  is  no  significant  nonlinear  effect  of  age  (P  =  .18).  How¬ 
ever,  the  general  test  of  lack  of  fit  with  5  d.f.  indicates  a  significant  departure 
from  the  linear  additive  model  (P  =  .005). 

In  Figure  10.7,  data  from  2332  patients  who  underwent  cardiac  catheteri¬ 
zation  at  Duke  University  Medical  Center  and  were  found  to  have  significant 
(>  75%)  diameter  narrowing  of  at  least  one  major  coronary  artery  were  ana¬ 
lyzed  (the  dataset  is  available  from  the  Web  site).  The  relationship  between 
the  time  from  the  onset  of  symptoms  of  coronary  artery  disease  (e.g.,  angina, 
myocardial  infarction)  to  the  probability  that  the  patient  has  severe  (three- 
vessel  disease  or  left  main  disease — tvdim)  coronary  disease  was  of  interest. 
There  were  1129  patients  with  tvdim.  A  logistic  model  was  used  with  the 
duration  of  symptoms  appearing  as  a  restricted  cubic  spline  function  with 
k  =  3, 4,  5,  and  6  equally  spaced  knots  in  terms  of  quantiles  between  .05  and 
.95.  The  best  fit  for  the  number  of  parameters  was  chosen  using  Akaike’s 
information  criterion  (AIC),  computed  in  Table  10.4  as  the  model  likelihood 
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ratio  x2  minus  twice  the  number  of  parameters  in  the  model  aside  from  the 
intercept.  The  linear  model  is  denoted  k  =  0. 

L 

dz  V-  subset  ( acath  ,  sigdz==l) 
dd  V-  datadist(dz) 


f  V-  lrm(tvdlm  res (cad.dur ,  5),  data=dz) 

w  V-  f unct i on (  .  .  . ) 
with ( dz  ,  { 

plsmo ( cad . dur  ,  tvdlm  ,  fun  =  qlogis  ,  add  =  TRUE  , 
grid  =  TRUE  ,  lty= ' dotted  '  ) 
x  V-  cut 2 ( cad . dur  ,  g  =  15 ,  levels . mean =TRUE ) 
prop  V-  qlogis  (tapply  (tvdlm  ,  x,  mean,  na  .  rm  =  TRUE  ) ) 
xm  V-  as  .  numeri c  (  names  ( prop  )  ) 
lpoints(xm,  prop,  pch  =  2,  col= 1  green  1  ) 

}  )  #  Figure  10.7 

plot ( Predi ct  (f  ,  cad.dur),  addpanel=w) 


C/5 

"O 

"O 

O 
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Duration  of  Symptoms  of  Coronary  Artery  Disease 


Fig.  10.7  Estimated  relationship  between  duration  of  symptoms  and  the  log  odds 
of  severe  coronary  artery  disease  for  k  =  5.  Knots  are  marked  with  arrows.  Solid  line 
is  spline  fit;  dotted  line  is  a  nonparametric  loess  estimate. 


Figure  10.7  displays  the  spline  fit  for  k  =  5.  The  triangles  represent  sub¬ 
group  estimates  obtained  by  dividing  the  sample  into  groups  of  150  patients. 
For  example,  the  leftmost  triangle  represents  the  logit  of  the  proportion 
of  tvdlm  in  the  150  patients  with  the  shortest  duration  of  symptoms,  ver¬ 
sus  the  mean  duration  in  that  group.  A  Wald  test  of  linearity,  with  3  d.f., 
showed  highly  significant  nonlinearity  (y2  =  23.92  with  3  d.f.).  The  plot  of  the 
spline  transformation  suggests  a  log  transformation,  and  when  log  (duration 
of  symptoms  in  months  +1)  was  fitted  in  a  logistic  model,  the  log  likelihood 
of  the  model  (119.33  with  1  d.f.)  was  virtually  as  good  as  the  spline  model 
(123.51  with  4  d.f.);  the  corresponding  Akaike  information  criteria  (on  the  y2 
scale)  are  117.33  and  115.51.  To  check  for  adequacy  in  the  log  transformation, 
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a  five-knot  restricted  cubic  spline  function  was  fitted  to  log10  (months  +  1), 
as  displayed  in  Figure  10.8.  There  is  some  evidence  for  lack  of  fit  on  the  right, 
but  the  Wald  y2  for  testing  linearity  yields  P  =  .27. 

L 

f  V-  lrm(tvdlm  ~  loglO ( cad . dur  +  1),  data=dz) 
w  V-  function (  .  .  . ) 
with ( dz  ,  { 

x  V-  cut 2 ( cad . dur ,  m  =  150  ,  levels . mean =TRUE ) 
prop  V-  tapply  (tvdlm  ,  x,  mean,  na.rm  =  TRUE) 
xm  V-  as . numeri c ( names ( prop ) ) 
lpoints(xm,  prop,  pch=2,  col= 1  green  1  ) 

}  ) 

#  Figure  10.8 

plot ( Predi ct  (f  ,  cad. dur,  fun  =  plogis  )  ,  ylab='P', 
ylim=c(.2,  .8),  addpanel=w) 


Duration  of  Symptoms  of  Coronary  Artery  Disease 

Fig.  10.8  Fitted  linear  logistic  model  in  log10 (duration  +  1),  with  subgroup  es¬ 
timates  using  groups  of  150  patients.  Fitted  equation  is  logit  (tvdlm)  =  —.9809  + 
.7122  log10  (months  +  1). 


If  the  model  contains  two  continuous  predictors,  they  may  both  be  ex¬ 
panded  with  spline  functions  in  order  to  test  linearity  or  to  describe  nonlinear 
relationships.  Testing  interaction  is  more  difficult  here.  If  X\  is  continuous, 
one  might  temporarily  group  X\  into  quantile  groups.  Consider  the  subset 
of  2258  (1490  with  disease)  of  the  3504  patients  used  in  Figure  10.6  who 
have  serum  cholesterol  measured.  A  logistic  model  for  predicting  significant 
coronary  disease  was  fitted  with  age  in  tertiles  (modeled  with  two  dummy 
variables),  sex,  age  x  sex  interaction,  four-knot  restricted  cubic  spline  in 
cholesterol,  and  age  fertile  x  cholesterol  interaction.  Except  for  the  sex  ad¬ 
justment  this  model  is  equivalent  to  fitting  three  separate  spline  functions  in 
cholesterol,  one  for  each  age  fertile.  The  fitted  model  is  shown  in  Figure  10.9 
for  cholesterol  and  age  fertile  against  logit  of  significant  disease.  Significant 
age  x  cholesterol  interaction  is  apparent  from  the  figure  and  is  suggested  by 
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the  Wald  y2  statistic  (10.03)  that  follows.  Note  that  the  test  for  linearity  of 
the  interaction  with  respect  to  cholesterol  is  very  insignificant  (y2  =  2.40  on 
4  d.f.),  but  we  retain  it  for  now.  The  fitted  function  is 
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Logistic  Regression  Model 

lrm(formula  =  sigdz  ~  age.tertile  *  (sex  +  res (cholesterol ,  4)), 
data  =  acath) 


Frequencies  of  Missing  Values  Due  to  Each  Variable 

sigdz  age.tertile  sex  cholesterol 

000  1246 


Model  Likelihood 
Ratio  Test 

Discrimination 

Indexes 

Rank  Discrim. 
Indexes 

Obs  2258 

0  768 

1  1490 

max  2  x  10-8 

LR  x2  533.52 

d.f.  14 

Pr(>  x2)  <  0.0001 

U2  0.291 

g  1.316 

gr  3.729 

gp  0.252 

Brier  0.173 

C  0.780 

Dxy  0.560 

7  0.562 

Ta  0.251 

Coef  S.E.  Wald  Z  Pr(>  \Z\) 


Intercept 

-0.4155 

1.0987 

-0.38 

0.7053 

age.tertile=  [49,58) 

0.8781 

1.7337 

0.51 

0.6125 

age.tertile=  [58,82] 

4.7861 

1.8143 

2.64 

0.0083 

sex=female 

-1.6123 

0.1751 

-9.21 

<  0.0001 

cholesterol 

0.0029 

0.0060 

0.48 

0.6347 

cholesterol’ 

0.0384 

0.0242 

1.59 

0.1126 

cholesterol” 

-0.1148 

0.0768 

-1.49 

0.1350 

age. tertile=  [49,58)  * 

sex=female  -0.7900 

0.2537 

-3.11 

0.0018 

age. tertile=  [58,82]  * 

sex=female  -0.4530 

0.2978 

-1.52 

0.1283 

age. tertile=  [49,58)  * 

cholesterol  0.0011 

0.0095 

0.11 

0.9093 
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age. tertile=  [58,82]  *  cholesterol 
age. tertile=  [49,58)  *  cholesterol’ 
age. tertile=  [58,82]  *  cholesterol’ 
age. tertile=  [49,58)  *  cholesterol” 
age. tertile=  [58,82]  *  cholesterol” 


Coef  S.E.  Wald  Z  Pr(>  \Z\) 


-0.0158 

0.0099 

-1.59 

0.1111 

-0.0183 

0.0365 

-0.50 

0.6162 

0.0127 

0.0406 

0.31 

0.7550 

0.0582 

0.1140 

0.51 

0.6095 

-0.0092 

0.1301 

-0.07 

0.9436 

ltx  ( f ) 


Xf3  =  —0.415  +  0.878[age.tertile  E  [49,58)]  +  4.79[age.tertile  E  [58,82]]  — 
1.61  [female]  +  0.00287cholesterol  +  1.52  x  10_6(cholesterol  —  160)^_  —  4.53  x 
10-6 (cholesterol  —  208)^  +  3.44  x  10~6  (cholesterol  —  243)^  —  4.28  x  10-7 
(cholesterol— 319)^_  + [female] [—0.79 [age. fertile  E  [49,  58)]— 0.453 [age. fertile  E 
[58,  82]]]  +  [age. fertile  E  [49,  58)]  [0.00108cholesterol—  7.23x  10“ 7 (cholesterol  — 
160)  +  +  2.3  x  10-6 (cholesterol  —  208)  +  —  1.84  x  10~6(cholesterol  —  243)  +  + 
2.69  x  10“  '  (cholesterol  —  319)+]  +  [age. fertile  E  [58,  82]]  [— 0.0158cholesterol  + 
5x  10_7(cholesterol  —  160)  +  —  3.64  x  10_7(cholesterol  —  208)  +  —  5.15  x  10“' 
(cholesterol  —  243)  +  +  3.78  x  10-7(cholesterol  —  319)^]. 


#  Table  10.5: 

L 

latex (anova (f )  ,  f ile  =  '  ' 

,  s ize  = ' smaller  '  , 

caption= 'Crudely 

categorizing  age  into  tertiles  '  , 

label= 'tab : anova- 

tertiles  '  ) 

yl  ^  c(-l  ,5) 

plot ( Predi ct  (f  ,  cholesterol  ,  age . t ert ile  )  , 

ad j . subt it le =FALSE  ,  ylim  =  yl)  #  Figure  10.9 


Table  10.5  Crudely  categorizing  age  into  tertiles 


x2 

d.f. 

P 

age.tertile  (Factor+Higher  Order  Factors) 

120.74 

10  <  0.0001 

All  Interactions 

21.87 

8 

0.0052 

sex  (Factor+Higher  Order  Factors) 

329.54 

3  <  0.0001 

All  Interactions 

9.78 

2 

0.0075 

cholesterol  (Factor+Higher  Order  Factors) 

93.75 

9  <  0.0001 

All  Interactions 

10.03 

6 

0.1235 

Nonlinear  (Factor+Higher  Order  Factors) 

9.96 

6 

0.1263 

age.tertile  x  sex  (Factor+Higher  Order  Factors) 

9.78 

2 

0.0075 

age.tertile  x  cholesterol  (Factor+Higher  Order  Factors) 

10.03 

6 

0.1235 

Nonlinear 

2.62 

4 

0.6237 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

2.62 

4 

0.6237 

TOTAL  NONLINEAR 

9.96 

6 

0.1263 

TOTAL  INTERACTION 

21.87 

8 

0.0052 

TOTAL  NONLINEAR  +  INTERACTION 

29.67 

10 

0.0010 

TOTAL 

410.75 

14  <  0.0001 
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Fig.  10.9  Log  odds  of  significant  coronary  artery  disease  modeling  age  with  two 
dummy  variables 


Before  fitting  a  parametric  model  that  allows  interaction  between  age  and 
cholesterol,  let  us  use  the  local  regression  model  of  Cleveland  et  al.96  dis¬ 
cussed  in  Section  2.4.7.  This  nonparametric  smoothing  method  is  not  meant 
to  handle  binary  Y,  but  it  can  still  provide  useful  graphical  displays  in  the 
binary  case.  Figure  10.10  depicts  the  fit  from  a  local  regression  model  predict¬ 
ing  Y  =  1  =  significant  coronary  artery  disease.  Predictors  are  sex  (modeled 
parametrically  with  a  dummy  variable),  age,  and  cholesterol,  the  last  two 
fitted  nonparametrically.  The  effect  of  not  explicitly  modeling  a  probability 
is  seen  in  the  figure,  as  the  predicted  probabilities  exceeded  1.  Because  of  this 
we  do  not  take  the  logit  transformation  but  leave  the  predicted  values  in  raw 
form.  However,  the  overall  shape  is  in  agreement  with  Figure  10.10. 


#  Re- 

do  model  wi 

th  co 

nt inuous 

age 

L 

f  <- 

loess ( sigdz 

~  ag 

e  *  ( s  x 

+  cho 

lest er 

ol  )  , 

data=acath , 

par  am 

etr  i  c 

=  " sx  ",  dr op . s 

quare = 

"sx") 

ages 

V-  seq  (25  , 

75 

,  length 

=40) 

chols 

V-  seq(100 

,  400 

,  length 

=40) 

g 

expand .grid 

(  chol 

est erol = 

chols 

,  age  = 

ages  , 

sx  =  0 ) 

#  drop  sex  dimen 

si  on 

of  grid 

since 

held 

to  1 

value 

P 

drop ( pr edi c 

t  (f  , 

g)  ) 

*d 

i — i 

*d 

A 

0.001]  «- 

0 . 001 

♦d 

i — i 

♦d 

V 

0.999]  <- 

0 . 999 

zl  V- 

c ( -3  ,  6) 

#  Fi 

gure  10.10 

wiref 

r ame ( qlogi s 

(p)  ~ 

cholest 

erol  * 

age  , 

xlab  =  1 

ist  (r 

ot  =30)  , 

ylab  = 

list  (r 

ot  =  -40 )  , 

zlab  =  1 

i st ( label = ' lo 

g  odd 

s',  ro 

t  =90) 

,  zlim=zl , 

scales 

=  li 

st ( arrow 

s  =  FALSE)  , 

data 

=  g) 

Chapter  2  discussed  linear  splines,  which  can  be  used  to  construct  linear 
spline  surfaces  by  adding  all  cross-products  of  the  linear  variables  and  spline 
terms  in  the  model.  With  a  sufficient  number  of  knots  for  each  predictor,  the 
linear  spline  surface  can  fit  a  wide  variety  of  patterns.  However,  it  requires 
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a  large  number  of  parameters  to  be  estimated.  For  the  age-sex-cholesterol 
example,  a  linear  spline  surface  is  fitted  for  age  and  cholesterol,  and  a  sex 
x  age  spline  interaction  is  also  allowed.  Figure  10.11  shows  a  fit  that  placed 
knots  at  quart iles  of  the  two  continuous  variables0.  The  algebraic  form  of  the 
fitted  model  is  shown  below. 

f  V-  lrm(sigdz  lsp  (  age  ,  c  (46 , 52 , 59)  )  * 

(sex  +  lsp ( cholesterol  ,  c ( 196 , 224 , 259) ) )  , 
dat a  =  acath ) 

ltx  (  f  ) 

Xj3  =  -1.83  +  0.0232  age  +  0.0759(age  -  46)+  -  0.0025(age  -  52)+  + 
2.27(age— 59)+ +3.02  [female]  —  0.0177cholesterol+0.114(cholesterol— 196)+  — 
0.131  (cholesterol  —  224)+  +  0.0651  (cholesterol  —  259)+  +  [female]  [— 0. 112  age  + 
0.0852  (age  —  46)+  —  0.0302  (age  —  52)+  +  0.176  (age  —  59)+]  +  age 
[0.000577  cholesterol  —  0.00286  (cholesterol  —  196)+  +  0.00382  (cholesterol  — 
224)+  —  0.00205  (cholesterol  —  259)+]  +  (age  —  46)  + [—0.000936  cholesterol  + 
0.00643(cholesterol— 196)  +— 0. 01 15  (cholesterol— 224)  + +0. 00756(cholesterol— 
259)+]  +  (age  —  52)+ [0.000433  cholesterol  —  0.0037  (cholesterol  —  196)+  + 
0.00815  (cholesterol  —  224)+  —  0.00715  (cholesterol  —  259)+]  +  (age  —  59)  + 
— 0.0124cholesterol+0.015  (cholesterol— 196)+  —  0.0067  (cholesterol—  224)  +  + 
0.00752  (cholesterol  -  259)+]. 


Fig.  10.10  Local  regression  fit  for  the  logit  of  the  probability  of  significant  coronary 
disease  vs.  age  and  cholesterol  for  males,  based  on  the  loess  function. 


c  In  the  wireframe  plots  that  follow,  predictions  for  cholesterol-age  combinations  for 
which  fewer  than  5  exterior  points  exist  are  not  shown,  so  as  to  not  extrapolate  to 
regions  not  supported  by  at  least  five  points  beyond  the  data  perimeter. 
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lat ex ( anova ( f )  ,  capt ion=  '  Linear  spline  surface 
size= 'smaller ' ,  label= 'tab: anova-lsp ') 


,  f ile  =  '  '  , 

#  Table  10.6 


per  im 

wi 

th ( acath  , 

L 

perimeter (cholesterol 

,  age,  xinc=20,  n=5)) 

i — 1 

N 

c  (-2  , 

4)  #  Figure  10.11 

bplot 

{ Pr edi 

ct  (f  ,  cholesterol  ,  age  , 

np=40) ,  perim=perim , 

If  un  = 

wireframe  ,  zlim  =  zl  ,  adj  . 

subtitle  = FALSE ) 

Table  10.6  Linear  spline  surface 


X2 

d.f. 

P 

age  (Factor+Higher  Order  Factors) 

164.17 

24  <  0.0001 

All  Interactions 

42.28 

20 

0.0025 

Nonlinear  (Factor+Higher  Order  Factors) 

25.21 

18 

0.1192 

sex  (Factor+Higher  Order  Factors) 

343.80 

5  <  0.0001 

All  Interactions 

23.90 

4 

0.0001 

cholesterol  (Factor+Higher  Order  Factors) 

100.13 

20  <  0.0001 

All  Interactions 

16.27 

16 

0.4341 

Nonlinear  (Factor+Higher  Order  Factors) 

16.35 

15 

0.3595 

age  x  sex  (Factor+Higher  Order  Factors) 

23.90 

4 

0.0001 

Nonlinear 

12.97 

3 

0.0047 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

12.97 

3 

0.0047 

age  x  cholesterol  (Factor+Higher  Order  Factors) 

16.27 

16 

0.4341 

Nonlinear 

11.45 

15 

0.7204 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

11.45 

15 

0.7204 

f(A,B)  vs.  Af(B)  +  Bg(A) 

9.38 

9 

0.4033 

Nonlinear  Interaction  in  age  vs.  Af(B) 

9.99 

12 

0.6167 

Nonlinear  Interaction  in  cholesterol  vs.  Bg(A) 

10.75 

12 

0.5503 

TOTAL  NONLINEAR 

33.22 

24 

0.0995 

TOTAL  INTERACTION 

42.28 

20 

0.0025 

TOTAL  NONLINEAR  +  INTERACTION 

49.03 

26 

0.0041 

TOTAL 

449.26 

29  <  0.0001 

Chapter  2  also  discussed  a  tensor  spline  extension  of  the  restricted  cubic 
spline  model  to  fit  a  smooth  function  of  two  predictors,  f(Xi,X2).  Since 
this  function  allows  for  general  interaction  between  X\  and  X2,  the  two- 
variable  cubic  spline  is  a  powerful  tool  for  displaying  and  testing  interaction, 
assuming  the  sample  size  warrants  estimating  2{k  —  1)  +  (fc  —  l)2  parameters 
for  a  rectangular  grid  of  k  x  k  knots.  Unlike  the  linear  spline  surface,  the 
cubic  surface  is  smooth.  It  also  requires  fewer  parameters  in  most  situations. 
The  general  cubic  model  with  k  =  4  (ignoring  the  sex  effect  here)  is 

A)  +  PiXi  +  (32X[  +  fcX'{  +  (34X2  +  fcX'2  +  p6X%  +  faXxX2 
+  PzXiX'i  +  PgXrXH  +  P^X[X2  +  (10.31) 

+  +/M'X"  +  Pi3Xr'X2  +  PuX'{X'2  + 
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where  and  X'^  are  restricted  cubic  spline  component  variables 

for  X\  and  X2  for  k  —  4.  A  general  test  of  interaction  with  9  d.f.  is  Hq  :  ^7  = 

. . .  =  /?i5  =  0.  A  test  of  adequacy  of  a  simple  product  form  interaction  is 
#0  :  /?8  =  •  •  •  =  P15  =  0  with  8  d.f.  A  13  d.f.  test  of  linearity  and  additivity 
is  Ho  :  =  Z?3  =  /?5  =  /?6  =  p7  =  /?8  =  /?9  =  Ao  =  /3ll  =  Pl2  =  /?13  =  /?14  = 

/?15  =  0  • 

Figure  10.12  depicts  the  fit  of  this  model.  There  is  excellent  agreement  with 
Figures  10.9  and  10.11,  including  an  increased  (but  probably  insignificant) 
risk  with  low  cholesterol  for  age  >  57. 

L 

f  lrm (sigdz  ~  res ( age  ,4) * (sex  +  res (cholesterol  ,4))  , 

data  =  acath  ,  tol  =  le-ll) 

ltx  (  f ) 

Xj3  =  -6.41  +  0.166age  -  0.00067(age  -  36)8  +  0.00543(age  -  48)8  - 
0.00727(age  —  56)+  +  0.00251  (age—  68)  +  +  2. 8  7  [female]  +  0.00979cholesterol+ 
1.96  x  10_6(cholesterol  —  160)  +  —  7.16  x  10_6(cholesterol  —  208)  +  +  6.35  x 
10_6(cholesterol— 243)+  — 1.16xl0_6(cholesterol— 319)++ [female]  [— 0.109age+ 
7.52xl0_5(age— 36)^_+0. 00015  (age— 48)^_— 0.00045  (age— 56)^_+0. 000225  (age- 
68)^]  +  age[— 0.00028cholesterol  +  2.68  x  10_9(cholesterol  —  160)  +  +  3.03  x 
10-8 (cholesterol  —  208)  +  —  4.99  x  10-8 (cholesterol  —  243)  +  +  1.69  x  10-8 
(cholesterol  —  319)+]  +  age/[0.00341cholesterol  —  4.02  x  10_7(cholesterol  — 
160)+  +  9.71xl0_7(cholesterol— 208)+  — 5.79xl0_7(cholesterol— 243)+  +  8.79x 
10-9  (cholesterol  —  319)+]  +  age"  [— 0.029cholesterol  +  3.04x  10-6  (cholesterol  — 
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Fig.  10.11  Linear  spline  surface  for  males,  with  knots  for  age  at  46,  52,  59  and  knots 
for  cholesterol  at  196,  224,  and  259  (quartiles). 
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160)  +  —  7.34  x  10  6(cholesterol  —  208)  +  +  4.36  x  10  6(cholesterol  —  243)  +  — 
5.82  x  10_8(cholesterol  —  319)+]. 

■ 

latex (anova (f )  ,  caption=  'Cubic  spline  surface',  f ile  =  '  '  , 
size  =  1  smaller  1  ,  label=  1  tab : anova-rcs  1 )  # Table  10.7 


#  Figure  10.12; 

bplot (Predict (f ,  cholesterol ,  age ,  np=40) ,  perim=perim , 
If un=wiref rame  ,  zlim=zl ,  ad j . subt it le =FALSE ) 


Table  10.7  Cubic  spline  surface 


X2  ' 

d.f. 

P 

age  (Factor+Higher  Order  Factors) 

165.23 

15  <  0.0001 

All  Interactions 

37.32 

12 

0.0002 

Nonlinear  (Factor+Higher  Order  Factors) 

21.01 

10 

0.0210 

sex  (Factor+Higher  Order  Factors) 

343.67 

4  <  0.0001 

All  Interactions 

23.31 

3  <  0.0001 

cholesterol  (Factor+Higher  Order  Factors) 

97.50 

12  <  0.0001 

All  Interactions 

12.95 

9 

0.1649 

Nonlinear  (Factor+Higher  Order  Factors) 

13.62 

8 

0.0923 

age  x  sex  (Factor+Higher  Order  Factors) 

23.31 

3  <  0.0001 

Nonlinear 

13.37 

2 

0.0013 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

13.37 

2 

0.0013 

age  x  cholesterol  (Factor+Higher  Order  Factors) 

12.95 

9 

0.1649 

Nonlinear 

7.27 

8 

0.5078 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

7.27 

8 

0.5078 

f(A,B)  vs.  Af(B)  +  Bg(A) 

5.41 

4 

0.2480 

Nonlinear  Interaction  in  age  vs.  Af(B) 

6.44 

6 

0.3753 

Nonlinear  Interaction  in  cholesterol  vs.  Bg(A) 

6.27 

6 

0.3931 

TOTAL  NONLINEAR 

29.22 

14 

0.0097 

TOTAL  INTERACTION 

37.32 

12 

0.0002 

TOTAL  NONLINEAR  +  INTERACTION 

45.41 

16 

0.0001 

TOTAL 

450.88 

19  <  0.0001 

Statistics  for  testing  age  x  cholesterol  components  of  this  fit  are  above. 
None  of  the  nonlinear  interaction  components  is  significant,  but  we  again 
retain  them. 

The  general  interaction  model  can  be  restricted  to  be  of  the  form 

f(X1,X2)  =  f1(X1)  +  f2(X2)  +  X1g2(X2)  +  X2g1(X1)  (10.32) 

by  removing  the  parameters  /?n,  /3i2,  /?i4,  and  /3is  from  the  model.  The  previ¬ 
ous  table  of  Wald  statistics  included  a  test  of  adequacy  of  this  reduced  form 
(x2  =  5.41  on  4  d.f.,  P  =  .248).  The  resulting  fit  is  in  Figure  10.13. 

L 

f  lrm(sigdz  sex  *  res  (  age  ,  4)  +  res  (  cholesterol  ,4)  + 

res  (age  ,4)  0/0ia0/0  res  (cholesterol  ,4)  ,  data  =  acath) 
latex (anova (f)  ,  f ile  =  '  '  ,  size= ' smaller  1  , 

capt ion= ' Singly  nonlinear  cubic  spline  surface', 
label =' tab : anova-ria  '  )  #Table  10.8 
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Fig.  10.12 


Restricted  cubic  spline  surface  in  two  variables,  each  with  k  —  4  knots 


Table  10.8  Singly  nonlinear  cubic  spline  surface 


X2 

d.f. 

P 

sex  (Factor+Higher  Order  Factors) 

343.42 

4 

< 

0.0001 

All  Interactions 

24.05 

3 

< 

0.0001 

age  (Factor+Higher  Order  Factors) 

169.35 

11 

< 

0.0001 

All  Interactions 

34.80 

8 

< 

0.0001 

Nonlinear  (Factor+Higher  Order  Factors) 

16.55 

6 

0.0111 

cholesterol  (Factor+Higher  Order  Factors) 

93.62 

8 

< 

0.0001 

All  Interactions 

10.83 

5 

0.0548 

Nonlinear  (Factor+Higher  Order  Factors) 

10.87 

4 

0.0281 

age  x  cholesterol  (Factor+Higher  Order  Factors) 

10.83 

5 

0.0548 

Nonlinear 

3.12 

4 

0.5372 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

3.12 

4 

0.5372 

Nonlinear  Interaction  in  age  vs.  Af(B) 

1.60 

2 

0.4496 

Nonlinear  Interaction  in  cholesterol  vs.  Bg(A) 

1.64 

2 

0.4400 

sex  x  age  (Factor+Higher  Order  Factors) 

24.05 

3 

< 

0.0001 

Nonlinear 

13.58 

2 

0.0011 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

13.58 

2 

0.0011 

TOTAL  NONLINEAR 

27.89 

10 

0.0019 

TOTAL  INTERACTION 

34.80 

8 

< 

0.0001 

TOTAL  NONLINEAR  +  INTERACTION 

45.45 

12 

< 

0.0001 

TOTAL 

453.10 

15 

< 

0.0001 

#  Figure  10.13; 

bplot (Predict  (f  ,  cholesterol  ,  age  ,  np=40)  ,  per im =per im  , 
If un  =  wiref rame  ,  zlim  =  zl ,  ad j  . subt it le =FALSE ) 

ltx  ( f ) 
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Table  10.9  Linear  interaction  surface 


X2 

d.f. 

P 

age  (Factor+Higher  Order  Factors) 

167.83 

7  <  0.0001 

All  Interactions 

31.03 

4  <  0.0001 

Nonlinear  (Factor+Higher  Order  Factors) 

14.58 

4 

0.0057 

sex  (Factor+Higher  Order  Factors) 

345.88 

4  <  0.0001 

All  Interactions 

22.30 

3 

0.0001 

cholesterol  (Factor+Higher  Order  Factors) 

89.37 

4  <  0.0001 

All  Interactions 

7.99 

1 

0.0047 

Nonlinear 

10.65 

2 

0.0049 

age  x  cholesterol  (Factor+Higher  Order  Factors) 

7.99 

1 

0.0047 

age  x  sex  (Factor+Higher  Order  Factors) 

22.30 

3 

0.0001 

Nonlinear 

12.06 

2 

0.0024 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

12.06 

2 

0.0024 

TOTAL  NONLINEAR 

25.72 

6 

0.0003 

TOTAL  INTERACTION 

31.03 

4  <  0.0001 

TOTAL  NONLINEAR  +  INTERACTION 

43.59 

8  <  0.0001 

TOTAL 

452.75 

11  <  0.0001 

Xj3  =  -7.2  +  2.96[female]+0.164age+7.23xl0-5(age-36)^_-0.000106(age- 
48)+  —  1.63  x  10_5(age  —  56)+  +  4.99  x  10_5(age  —  68)+  +  0.0148cholesterol  + 
1.21  x  10-6  (cholesterol  —  160)+  —  5.5  x  10-6  (cholesterol  —  208)+  +  5.5  x 
10-6  (cholesterol  —  243)+  —  1.21  x  10-6(cholesterol  —  319)+  +  age  —0.00029 
cholesterol  +  9. 28xl0_9(cholesterol  —  160)+  +  1.7xl0_8(cholesterol  —  208)  +  — 
4. 43xl0-8  (cholesterol— 243)++ 1.79xl0-8  (cholesterol— 319)+]  +cholesterol[2.3x 
10_7(age  -  36)+  +  4.21  x  10“7(age  -  48)^_  -  1.31  x  10"6(age  -  56)^_  +  6.64  x 
10-7  (age— 68)+]  + [female]  [— 0.111age+8.03xl0_5(age— 36)  ++0.000135(age— 
48)+  -  0.00044(age  -  56)8  +  0.000224(age  -  68)^]. 

The  fit  is  similar  to  the  former  one  except  that  the  climb  in  risk  for  low- 
cholesterol  older  subjects  is  less  pronounced.  The  test  for  nonlinear  interac¬ 
tion  is  now  more  concentrated  (P  =  .54  with  4  d.f.).  Figure  10.14  accordingly 
depicts  a  fit  that  allows  age  and  cholesterol  to  have  nonlinear  main  effects, 
but  restricts  the  interaction  to  be  a  product  between  (untransformed)  age 
and  cholesterol.  The  function  agrees  substantially  with  the  previous  fit. 


f  V-  lrm(sigdz  ~  r  cs  (  age  ,  4)  *  sex  +  res  (  cholesterol  ,  4)  + 

age  0/0ia°/o  cholesterol  ,  data  =  acath) 
latex (anova (f) ,  caption= 'Linear  interaction  surface',  f ile  =  '  '  , 
s ize =' smaller  '  ,  label =' tab : anova-lia  1  )  #Table  10.9 


#  Figure  10.14; 

bplot (Predict (f ,  cholesterol ,  age ,  np=40) ,  perim=perim , 
If un=wiref rame  ,  zlim=zl ,  ad j . subt it le =FALSE ) 
f.linia  f  #  save  linear  interaction  fit  for  later 
ltx  (  f ) 
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Fig.  10.13  Restricted  cubic  spline  fit  with  age  x  spline(cholesterol)  and  cholesterol 
x  spline(age) 


Xj3  =  — 7.36+0R82age— 5.18xl0-5(age— 36)3 +8.45xl0-5(age— 48)3  — 2.91x 
10_6(age  —  56)+  —  2.99  x  10_5(age  —  68)  +  +  2.8[female]  +  0.0139cholesterol  + 
1.76  x  10-6 (cholesterol  —  160)+  —  4.88  x  10_6(cholesterol  —  208)  +  +  3.45  x 
10-6  (cholesterol  —  243)  +  —  3.26  x  10“ '(cholesterol  —  319)  +  —  0.00034  age  x 
cholesterol  +  [female]  [— 0.107age  +  7.71  x  10_5(age  —  36)  +  +  0.000115(age  — 
48)3  -  0.000398(age  -  56)3  +  o.000205(age  -  68)3_]. 

The  Wald  test  for  age  x  cholesterol  interaction  yields  y2  =  7.99  with  1 
d.f.,  P  =  .005.  These  analyses  favor  the  nonlinear  model  with  simple  prod¬ 
uct  interaction  in  Figure  10.14  as  best  representing  the  relationships  among 
cholesterol,  age,  and  probability  of  prognostically  severe  coronary  artery  dis¬ 
ease.  A  nomogram  depicting  this  model  is  shown  in  Figure  10.21. 

Using  this  simple  product  interaction  model,  Figure  10.15  displays  pre¬ 
dicted  cholesterol  effects  at  the  mean  age  within  each  age  fertile.  Substantial 
agreement  with  Figure  10.9  is  apparent. 

L 

#  Make  estimates  of  cholesterol  effects  for  mean  age  in 

#  tertiles  corresponding  to  initial  analysis 
mean  .  age  V- 

with ( acath  , 

as . vector ( tapply ( age ,  age.tertile  ,  mean,  na . rm =TRUE ) ) ) 
plot (Predict  (f,  cholesterol  ,  age  =  round (mean . age  ,2)  , 

sex= " male " )  , 

ad j . subt it le =FALSE  ,  ylim  =  yl)  #3  curves  ,  Figure  10.15 
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Fig.  10.14  Spline  fit  with  nonlinear  effects  of  cholesterol  and  age  and  a  simple 
product  interaction 


Cholesterol,  mg  % 


Fig.  10.15  Predictions  from  linear  interaction  model  with  mean  age  in  tertiles  indi¬ 
cated. 


The  partial  residuals  discussed  in  Section  10.4  can  be  used  to  check  lo¬ 
gistic  model  fit  (although  it  may  be  difficult  to  deal  with  interactions).  As 
an  example,  reconsider  the  “duration  of  symptoms”  fit  in  Figure  10.7.  Fig¬ 
ure  10.16  displays  “loess  smoothed”  and  raw  partial  residuals  for  the  original 
and  log-transformed  variable.  The  latter  provides  a  more  linear  relationship, 
especially  where  the  data  are  most  dense. 
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Table  10.10  Merits  of  Methods  for  Checking  Logistic  Model  Assumptions 


Method 

Choice 

Assumes 

Uses  Ordering 

Low 

Good 

Required 

Additivity 

of  A 

Variance 

Resolution 
on  A 

Stratification 

Intervals 

Smoother  on  V 
stratifying  on  X2 

Bandwidth  x  x  x 

(not  on  X2)  (if  min.  strat.)  (Ai) 

Smooth  partial 
residual  plot 

Bandwidth  x  x  x  x 

Spline  model 
for  all  As 

Knots  xx  xx 

f  V-  lrm(tvdlm  ~  cad. dur ,  data=dz ,  x  =  TRUE  ,  y=TRUE) 
resid(f,  "partial",  pl  =  "loess",  xlim  =  c  (0 , 250)  ,  ylim  =  c (-3  ,  3) ) 
scat  Id (dz$cad. dur) 

log. cad. dur  V-  loglO ( dz $  cad . dur  +  1) 

f  V-  lrm(tvdlm  ~  log. cad. dur  ,  data=dz ,  x=TRUE ,  y=TRUE) 
resid(f,  "partial",  pl  =  "loess",  ylim  =  c ( -3 , 3) ) 
scat  Id ( log . cad . dur )  #  Figure  10.16 


Fig.  10.16  Partial  residuals  for  duration  and  log10(duration+l).  Data  density  shown 
at  top  of  each  plot. 


Table  10.10  summarizes  the  relative  merits  of  stratification,  nonparametric 
smoothers,  and  regression  splines  for  determining  or  checking  binary  logistic 
model  fits. 


10.7  Overly  Influential  Observations 
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10.6  Collinearity 

The  variance  inflation  factors  (VIFs)  discussed  in  Section  4.6  can  apply  to 
any  regression  fit.147,654  These  VIFs  allow  the  analyst  to  isolate  which  vari¬ 
able  (s)  are  responsible  for  highly  correlated  parameter  estimates.  Recall  that, 
in  general,  collinearity  is  not  a  large  problem  compared  with  nonlinearity  and 
overfitting. 


10.7  Overly  Influential  Observations 

Pregibon  11  developed  a  number  of  regression  diagnostics  that  apply  to  the 
family  of  regression  models  of  which  logistic  regression  is  a  member.  Influence 
statistics  based  on  the  “leave-out-one”  method  use  an  approximation  to  avoid 
having  to  refit  the  model  n  times  for  n  observations.  This  approximation 
uses  the  fit  and  covariance  matrix  at  the  last  iteration  and  assumes  that 
the  “weights”  in  the  weighted  least  squares  fit  can  be  kept  constant,  yielding 
a  computationally  feasible  one-step  estimate  of  the  leave-out-one  regression 
coefficients. 

Hosmer  and  Lemeshow  [305,  pp.  149-170]  discuss  many  diagnostics  for 
logistic  regression  and  show  how  the  final  fit  can  be  used  in  any  least  squares 
program  that  provides  diagnostics.  A  new  dependent  variable  to  be  used  in 
that  way  is 

/\ 

Zi=XP-\ - — — ,  (10.33) 

Vi 

A  A  A  A  . 

where  Vi  =  Pi(l  —  P^),  and  Pi  =  [1  +  exp  is  the  predicted  probability 

that  Yi  —  1.  The  V$,  z  =  1,  2, . . . ,  n  are  used  as  weights  in  an  ordinary  weighted 
least  squares  fit  of  X  against  Z.  This  least  squares  fit  will  provide  regression 
coefficients  identical  to  b.  The  new  standard  errors  will  be  off  from  the  actual 
logistic  model  ones  by  a  constant. 

As  discussed  in  Section  4.9,  the  standardized  change  in  the  regression  co¬ 
efficients  upon  leaving  out  each  observation  in  turn  (DFBETAS)  is  one  of  the 
most  useful  diagnostics,  as  these  can  pinpoint  which  observations  are  influ¬ 
ential  on  each  part  of  the  model.  After  carefully  modeling  predictor  trans¬ 
formations,  there  should  be  no  lack  of  fit  due  to  improper  transformations. 
However,  as  the  white  blood  count  example  in  Section  4.9  indicates,  it  is 
commonly  the  case  that  extreme  predictor  values  can  still  have  too  much 
influence  on  the  estimates  of  coefficients  involving  that  predictor. 

In  the  age-sex-response  example  of  Section  10.1.3,  both  DFBETAS  and 
DFFITS  identified  the  same  influential  observations.  The  observation  given 
by  age  =  48  sex  =  female  response  =  1  was  influential  for  both  age  and  sex, 
while  the  observation  age  =  34  sex  =  male  response  =  1  was  influential  for 
age  and  the  observation  age  =  50  sex  =  male  response  =  0  was  influential 
for  sex.  It  can  readily  be  seen  from  Figure  10.3  that  these  points  do  not  fit 
the  overall  trends  in  the  data.  However,  as  these  data  were  simulated  from  a 
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Table  10.11  Example  influence  statistics 


Females 

Males 

DFBETAS 

DFFITS 

DFBETAS 

DFFITS 

Intercept 

Age 

Sex 

Intercept 

Age 

Sex 

0.0 

0.0 

0.0 

0 

0.5 

-0.5 

-0.2 

2 

0.0 

0.0 

0.0 

0 

0.2 

-0.3 

0.0 

1 

0.0 

0.0 

0.0 

0 

-0.1 

0.1 

0.0 

-1 

0.0 

0.0 

0.0 

0 

-0.1 

0.1 

0.0 

-1 

-0.1 

0.1 

0.1 

0 

-0.1 

0.1 

-0.1 

-1 

-0.1 

0.1 

0.1 

0 

0.0 

0.0 

0.1 

0 

0.7 

-0.7 

-0.8 

3 

0.0 

0.0 

0.1 

0 

-0.1 

0.1 

0.1 

0 

0.0 

0.0 

0.1 

0 

-0.1 

0.1 

0.1 

0 

0.0 

0.0 

-0.2 

-1 

-0.1 

0.1 

0.1 

0 

0.1 

-0.1 

-0.2 

-1 

-0.1 

0.1 

0.1 

0 

0.0 

0.0 

0.1 

0 

-0.1 

0.0 

0.1 

0 

-0.1 

0.1 

0.1 

0 

-0.1 

0.0 

0.1 

0 

-0.1 

0.1 

0.1 

0 

0.1 

0.0 

-0.2 

1 

0.3 

-0.3 

-0.4 

-2 

0.0 

0.0 

0.1 

-1 

-0.1 

0.1 

0.1 

0 

0.1 

-0.2 

0.0 

-1 

-0.1 

0.1 

0.1 

0 

-0.1 

0.2 

0.0 

1 

-0.1 

0.1 

0.1 

0 

-0.2 

0.2 

0.0 

1 

0.0 

0.0 

0.0 

0 

-0.2 

0.2 

0.0 

1 

0.0 

0.0 

0.0 

0 

-0.2 

0.2 

0.1 

1 

0.0 

0.0 

0.0 

0 

11 


population  model  that  is  truly  linear  in  age  and  additive  in  age  and  sex,  the 
apparent  influential  observations  are  just  random  occurrences.  It  is  unwise 
to  assume  that  in  real  data  all  points  will  agree  with  overall  trends.  Removal 
of  such  points  would  bias  the  results,  making  the  model  apparently  more 
predictive  than  it  will  be  prospectively.  See  Table  10.11. 

f  V-  update (fasr,  x=TRUE ,  y=TRUE) 
whi  ch  .  inf  luence  (f  ,  .4)  #  Table  10.11 
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10.8  Quantifying  Predictive  Ability 

The  test  statistics  discussed  above  allow  one  to  test  whether  a  factor  or  set  of 
factors  is  related  to  the  response.  If  the  sample  is  sufficiently  large,  a  factor 
that  grades  risk  from  .01  to  .02  may  be  a  significant  risk  factor.  However,  that 
factor  is  not  very  useful  in  predicting  the  response  for  an  individual  subject. 
There  is  controversy  regarding  the  appropriateness  of  R2  from  ordinary  least 
squares  in  this  setting.136,424  The  generalized  R ^  index  of  Nagelkerke4^1  and 
Cragg  and  Uhler137,  Maddala431,  and  Magee432  described  in  Section  9.8.3 
can  be  useful  for  quantifying  the  predictive  strength  of  a  model: 
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2  =  1  -  exp(— LR/n) 
N  1  —  exp(— L°/n)  ' 


(10.34) 


where  LR  is  the  global  log  likelihood  ratio  statistic  for  testing  the  importance 
of  all  p  predictors  in  the  model  and  L°  is  the  —2  log  likelihood  for  the  null 
model. 

Tjur613  coined  the  term  “coefficient  of  discrimination”  D,  defined  as  the 
average  P  when  Y  —  1  minus  the  average  P  when  Y  =  0,  and  showed  how  it 
ties  in  with  sum  of  squares-based  R 2  measures.  D  has  many  advantages  as 
an  index  of  predictive  powerd. 

Linnet416  advocates  quadratic  and  logarithmic  probability  scoring  rules 
for  measuring  predictive  performance  for  probability  models.  Linnet  shows 
how  to  bootstrap  such  measures  to  get  bias-corrected  estimates  and  how  to 
use  bootstrapping  to  compare  two  correlated  scores.  The  quadratic  scoring 
rule  is  Brier’s  score,  frequently  used  in  judging  meteorologic  forecasts30,73: 


1 


n 


(10.35) 


i— 1 
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where  Pi  is  the  predicted  probability  and  Y\  the  corresponding  observed  re¬ 
sponse  for  the  ith  observation. 

A  unitless  index  of  the  strength  of  the  rank  correlation  between  predicted 
probability  of  response  and  actual  response  is  a  more  interpretable  measure  of 
the  fitted  model’s  predictive  discrimination.  One  such  index  is  the  probability 
of  concordance,  c,  between  predicted  probability  and  response.  The  c  index, 
which  is  derived  from  the  Wilcoxon-Mann- Whitney  two-sample  rank  test, 
is  computed  by  taking  all  possible  pairs  of  subjects  such  that  one  subject 
responded  and  the  other  did  not.  The  index  is  the  proportion  of  such  pairs 
with  the  responder  having  a  higher  predicted  probability  of  response  than 
the  nonresponder. 

Bamber39  and  Hanley  and  McNeil  have  shown  that  c  is  identical  to  a 
widely  used  measure  of  diagnostic  discrimination,  the  area  under  a  “receiver 
operating  characteristic”  (ROC)  curve.  A  value  of  c  of  .5  indicates  random  pre¬ 
dictions,  and  a  value  of  1  indicates  perfect  prediction  (i.e.,  perfect  separation 
of  responders  and  nonresponders).  A  model  having  c  greater  than  roughly 
.8  has  some  utility  in  predicting  the  responses  of  individual  subjects.  The 
concordance  index  is  also  related  to  another  widely  used  index,  Somers’  Dxy 
rank  correlation  9  between  predicted  probabilities  and  observed  responses, 
by  the  identity 

Dxy  =  2(c-  .5).  (10.36) 

Dxy  is  the  difference  between  concordance  and  discordance  probabilities. 
When  Dxy  =  0,  the  model  is  making  random  predictions.  When  Dxy  =  1, 


d  Note  that  D  and  B  (below)  and  other  indexes  not  related  to  c  (below)  do  not  work 
well  in  case-control  studies  because  of  their  reliance  on  absolute  probability  estimates. 
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the  predictions  are  perfectly  discriminating.  These  rank-based  indexes  have 
the  advantage  of  being  insensitive  to  the  prevalence  of  positive  responses. 

A  commonly  used  measure  of  predictive  ability  for  binary  logistic  models  is 
the  fraction  of  correctly  classified  responses.  Here  one  chooses  a  cutoff  on  the 
predicted  probability  of  a  positive  response  and  then  predicts  that  a  response 
will  be  positive  if  the  predicted  probability  exceeds  this  cutoff.  There  are  a 
number  of  reasons  why  this  measure  should  be  avoided. 

1.  It’s  highly  dependent  on  the  cutpoint  chosen  for  a  “positive”  prediction. 

2.  You  can  add  a  highly  significant  variable  to  the  model  and  have  the  per¬ 
centage  classified  correctly  actually  decrease.  Classification  error  is  a  very 
insensitive  and  statistically  inefficient  measure264, 633  since  if  the  threshold 
for  “positive”  is,  say  0.75,  a  prediction  of  0.99  rates  the  same  as  one  of 
0.751. 

3.  It  gets  away  from  the  purpose  of  fitting  a  logistic  model.  A  logistic  model 
is  a  model  for  the  probability  of  an  event,  not  a  model  for  the  occurrence 
of  the  event.  For  example,  suppose  that  the  event  we  are  predicting  is 
the  probability  of  being  struck  by  lightning.  Without  having  any  data, 
we  would  predict  that  you  won’t  get  struck  by  lightning.  However,  you 
might  develop  an  interesting  model  that  discovers  real  risk  factors  that 
yield  probabilities  of  being  struck  that  range  from  0.000000001  to  0.001. 

4.  If  you  make  a  classification  rule  from  a  probability  model,  you  are  being 
presumptuous.  Suppose  that  a  model  is  developed  to  assist  physicians 
in  diagnosing  a  disease.  Physicians  sometimes  profess  to  desiring  a  binary 
decision  model,  but  if  given  a  probability  they  will  rightfully  apply  different 
thresholds  for  treating  different  patients  or  for  ordering  other  diagnostic 
tests.  Even  though  the  age  of  the  patient  may  be  a  strong  predictor  of 
the  probability  of  disease,  the  physician  will  often  use  a  lower  threshold 
of  disease  likelihood  for  treating  a  young  patient.  This  usage  is  above  and 
beyond  how  age  affects  the  likelihood. 

5.  If  a  disease  were  present  in  only  0.02  of  the  population,  one  could  be  0.98 
accurate  in  diagnosing  the  disease  by  ruling  that  everyone  is  disease-free, 
i.e.,  by  avoiding  predictors.  The  proportion  classified  correctly  fails  to  take 
the  difficulty  of  the  task  into  account. 

6.  van  Houwelingen  and  le  Cessie633  demonstrated  a  peculiar  property  that 
occurs  when  you  try  to  obtain  an  honest  estimate  of  classification  error 
using  cross-validation.  The  cross- validated  error  rate  corrects  the  apparent 
error  rate  only  if  the  predicted  probability  is  exactly  1/2  or  is  l/2±l/(2n). 
The  cross-validation  estimate  of  optimism  is  “zero  for  n  even  and  negligibly 
small  for  n  odd.”  Better  measures  of  error  rate  such  as  the  Brier  score  and 
logarithmic  scoring  rule  do  not  have  this  problem.  They  also  have  the 
nice  property  of  being  maximized  when  the  predicted  probabilities  are  the 
population  probabilities.416. 
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10.9  Validating  the  Fitted  Model 

The  major  cause  of  unreliable  models  is  overfitting  the  data.  The  methods 
described  in  Section  5.3  can  be  used  to  assess  the  accuracy  of  models  fairly. 
If  a  sample  has  been  held  out  and  never  used  to  study  associations  with  the 
response,  indexes  of  predictive  accuracy  can  now  be  estimated  using  that 
sample.  More  efficient  is  cross-validation,  and  bootstrapping  is  the  most  ef¬ 
ficient  validation  procedure.  As  discussed  earlier,  bootstrapping  does  not  re¬ 
quire  holding  out  any  data,  since  all  aspects  of  model  development  (stepwise 
variable  selection,  tests  of  linearity,  estimation  of  coefficients,  etc.)  are  re¬ 
validated  on  samples  taken  with  replacement  from  the  whole  sample. 

Cox130  proposed  and  Harrell  and  Lee26/  and  Miller  et  al.  further  de¬ 
veloped  the  idea  of  fitting  a  new  binary  logistic  model  to  a  new  sample  to 
estimate  the  relationship  between  the  predicted  probability  and  the  observed 
outcome  in  that  sample.  This  fit  provides  a  simple  calibration  equation  that 
can  be  used  to  quantify  unreliability  (lack  of  calibration)  and  to  calibrate 
the  predictions  for  future  use.  This  logistic  calibration  also  leads  to  indexes 
of  unreliability  (£/),  discrimination  (P),  and  overall  quality  (Q  =  D  —  JJ) 
which  are  derived  from  likelihood  ratio  tests26 7  Q  is  a  logarithmic  scoring 
rule,  which  can  be  compared  with  Brier’s  index  (Equation  10.35).  See  [633] 
for  many  more  ideas. 

With  bootstrapping  we  do  not  have  a  separate  validation  sample  for  as¬ 
sessing  calibration,  but  we  can  estimate  the  overoptimism  in  assuming  that 
the  final  model  needs  no  calibration,  that  is,  it  has  overall  intercept=0  and 
slope=l.  As  discussed  in  Section  5.3,  refitting  the  model 

Pc  =  Prob{r  =  1\X$}  =  [1  +  exp -(7o  +  71 V)]-1  (10.37) 

(where  Pc  denotes  the  calibrated  probability  and  the  original  predicted  prob- 
ability  is  P  =  [1  +  exp(— Xf3)\~l)  in  the  original  sample  will  always  result  in 
7  =  (70?  7i)  =  (0, 1),  since  a  logistic  model  will  always  “fit”  the  training  sam¬ 
ple  when  assessed  overall.  We  thus  estimate  7  by  using  Efron’s172  method  to 
estimate  the  overoptimism  in  (0, 1)  to  obtain  bias-corrected  estimates  of  the 
true  calibration.  Simulations  have  shown  this  method  produces  an  efficient 
estimate  of  7. 259 

More  stringent  calibration  checks  can  be  made  by  running  separate  calibra¬ 
tions  for  different  covariate  levels.  Smooth  nonpar ametric  curves  described  in 
Section  10.11  are  more  flexible  than  the  linear-logit  calibration  method  just 
described. 

A  good  set  of  indexes  to  estimate  for  summarizing  a  model  validation  is  the 
c  or  Dxy  indexes  and  measures  of  calibration.  In  addition,  the  overoptimism 

in  the  indexes  may  be  reported  to  quantify  the  amount  of  overfitting  present. 

/\ 

The  estimate  of  7  can  be  used  to  draw  a  calibration  curve  by  plotting  P 
on  the  x-axis  and  Pc  =  [1  +  exp— (70  +  yiP)]-1  on  the  y- axis,  where  L  = 
logit (P). 130,267  An  easily  interpreted  index  of  unreliability,  Pmacc,  follows 
immediately  from  this  calibration  model: 
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=  max 
a<P<b 


(10.38) 


the  maximum  error  in  predicted  probabilities  over  the  range  a  <  P  <  b.  In 
some  cases,  we  would  compute  the  maximum  absolute  difference  in  predicted 
and  calibrated  probabilities  over  the  entire  interval,  that  is,  use  Umaa;(0, 1). 
The  null  hypothesis  Hq  :  Emax(Q,l)  =  0  can  easily  be  tested  by  testing 
Hq  :  70  =  0, 71  =  1  as  above.  Since  Emax  does  not  weight  the  discrepancies 
by  the  actual  distribution  of  predictions,  it  may  be  preferable  to  compute  the 
average  absolute  discrepancy  over  the  actual  distribution  of  predictions  (or 
to  use  a  mean  squared  error,  incorporating  the  same  calibration  function). 

If  stepwise  variable  selection  is  being  done,  a  matrix  depicting  which  factors 
are  selected  at  each  bootstrap  sample  will  shed  light  on  how  arbitrary  is  the 
selection  of  “significant”  factors.  See  Section  5.3  for  reasons  to  compare  full 
and  stepwise  model  fits. 

As  an  example  using  bootstrapping  to  validate  the  calibration  and  discrim¬ 
ination  of  a  model,  consider  the  data  in  Section  10.1.3.  Using  150  samples  with 
replacement,  we  first  validate  the  additive  model  with  age  and  sex  forced  into 
every  model.  The  optimism-corrected  discrimination  and  calibration  statistics 
produced  by  validate  (see  Section  10.11)  are  in  the  table  below. 
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caption=  1  Bootstrap  Validation,  2  Predictors  Without 
St  epdown  '  ,  digits=2,  size  =  '  Ssize  '  ,  f ile  =  '  '  ) 


Bootstrap  Validation,  2  Predictors  Without  Stepdown 


Index  Original  Training  Test  Optimism  Corrected  n 
Sample  Sample  Sample  Index 


EXy 

0.70 

0.70 

0.67 

0.04 

0.66  150 

R2 

0.45 

0.48 

0.43 

0.05 

0.40  150 

Intercept 

0.00 

0.00 

0.01 

-0.01 

0.01  150 

Slope 

1.00 

1.00 

0.91 

0.09 

0.91  150 

-Umax 

0.00 

0.00 

0.02 

0.02 

0.02  150 

D 

0.39 

0.44 

0.36 

0.07 

0.32  150 

u 

-0.05 

-0.05 

0.04 

-0.09 

0.04  150 

Q 

0.44 

0.49 

0.32 

0.16 

0.28  150 

B 

0.16 

0.15 

0.18 

-0.03 

0.19  150 

9 

2.10 

2.49 

1.97 

0.52 

1.58  150 

9p 

0.35 

0.35 

0.34 

0.01 

0.34  150 
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Now  we  incorporate  variable  selection.  The  variables  selected  in  the  first 
10  bootstrap  replications  are  shown  below.  The  apparent  Somers’  Dxy  is  0.7, 
and  the  bias-corrected  Dxy  is  0.66.  The  slope  shrinkage  factor  is  0.91.  The 
maximum  absolute  error  in  predicted  probability  is  estimated  to  be  0.02. 

We  next  allow  for  step-down  variable  selection  at  each  resample.  For  illus¬ 
tration  purposes  only,  we  use  a  suboptimal  stopping  rule  based  on  significance 
of  individual  variables  at  the  a  =  0.10  level.  Of  the  150  repetitions,  both  age 
and  sex  were  selected  in  137,  and  neither  variable  was  selected  in  3  samples. 
The  validation  statistics  are  in  the  table  below. 

v2  V-  validate  (f  ,  B  =  150  ,  bw  =  TRUE  , 

rule  =  'p  '  ,  sls=.l ,  type  =  'individual  ') 

- L 

latex (v2 , 

caption=  1  Bootstrap  Validation,  2  Predictors  with  Stepdown  '  , 
digits=2,  B  =  15,  f i 1 e  =  '  '  ,  size  =  1  Ssize  1  ) 


Bootstrap  Validation,  2  Predictors  with  Stepdown 


Index  Original  Training  Test  Optimism  Corrected  n 
Sample  Sample  Sample  Index 


Dxy 

0.70 

0.70 

0.64 

0.07 

0.63  150 

R2 

0.45 

0.49 

0.41 

0.09 

0.37  150 

Intercept 

0.00 

0.00 

-0.04 

0.04 

-0.04  150 

Slope 

1.00 

1.00 

0.84 

0.16 

0.84  150 

-Fmax 

0.00 

0.00 

0.05 

0.05 

0.05  150 

D 

0.39 

0.45 

0.34 

0.11 

0.28  150 

u 

-0.05 

-0.05 

0.06 

-0.11 

0.06  150 

Q 

0.44 

0.50 

0.28 

0.22 

0.22  150 

B 

0.16 

0.14 

0.18 

-0.04 

0.20  150 

9 

2.10 

2.60 

1.88 

0.72 

1.38  150 

9p 

0.35 

0.35 

0.33 

0.02 

0.33  150 

Factors  Retained  in  Backwards  Elimination 
First  15  Resamples 
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sex  age 


Frequencies  of  Numbers  of  Factors  Retained 

0~T  2~ 

3  10  137 


The  apparent  Somers’  Dxy  is  0.7  for  the  original  stepwise  model  (which  ac¬ 
tually  retained  both  age  and  sex),  and  the  bias-corrected  Dxy  is  0.63,  slightly 
worse  than  the  more  correct  model  which  forced  in  both  variables.  The  cal¬ 
ibration  was  also  slightly  worse  as  reflected  in  the  slope  correction  factor 
estimate  of  0.84  versus  0.91. 

Next,  five  additional  candidate  variables  are  considered.  These  variables 
are  random  uniform  variables,  xl, . . .  ,x5  on  the  [0, 1]  interval,  and  have  no 
association  with  the  response. 


10.9  Validating  the  Fitted  Model 


263 


function(x)  paste ( v  [1 : 2]  [ x]  ,  collapse  =  '  ,  ')) 

t  able  (  past  e  (  as  ,  '  nx  ,  '  Xs  '  )  ) 


latex (v3  , 

caption= 1  Bootstrap  Validation  with  5  Noise  Variables  and 
Stepdown  digits=2,  B  =  15,  size  =  '  Ssize  '  ,  f ile  =  '  '  ) 


Bootstrap  Validation  with  5  Noise  Variables  and  Stepdown 


Index  Original  Training  Test  Optimism  Corrected  n 
Sample  Sample  Sample  Index 


DXy 

0.70 

0.47 

0.38 

0.09 

0.60  139 

R2 

0.45 

0.34 

0.23 

0.11 

0.34  139 

Intercept 

0.00 

0.00 

0.03 

-0.03 

0.03  139 

Slope 

1.00 

1.00 

0.78 

0.22 

0.78  139 

Frnax 

0.00 

0.00 

0.06 

0.06 

0.06  139 

D 

0.39 

0.31 

0.18 

0.13 

0.26  139 

u 

-0.05 

-0.05 

0.07 

-0.12 

0.07  139 

Q 

0.44 

0.36 

0.11 

0.25 

0.19  139 

B 

0.16 

0.17 

0.22 

-0.04 

0.20  139 

9 

2.10 

1.81 

1.06 

0.75 

1.36  139 

9p 

0.35 

0.23 

0.19 

0.04 

0.31  139 

Factors  Retained  in  Backwards  Elimination 
First  15  Resamples 


age  sex  xl  x2  x3  x4  x5 
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Frequencies  of  Numbers  of  Factors  Retained 

0123456 
50  15  37  18  11  7  1 

Using  step-down  variable  selection  with  the  same  stopping  rule  as  before, 
the  “final”  model  on  the  original  sample  correctly  deleted  xl, . . .  ,x5.  Of  the 
150  bootstrap  repetitions,  11  samples  yielded  a  singularity  or  non-convergence 
either  in  the  full-model  fit  or  after  step-down  variable  selection.  Of  the  139 
successful  repetitions,  the  frequencies  of  the  number  of  factors  selected,  as 
well  as  the  frequency  of  variable  combinations  selected,  are  shown  above. 
Validation  statistics  are  also  shown  above. 

Figure  10.17  depicts  the  calibration  (reliability)  curves  for  the  three  strate¬ 
gies  using  the  corrected  intercept  and  slope  estimates  in  the  above  tables  as 
7o  and  71,  and  the  logistic  calibration  model  Pc  =  [1  +  exp  —(70  +  71 L)]_1, 
where  Pc  is  the  “actual”  or  calibrated  probability,  L  is  logit (P),  and  P  is  the 
predicted  probability.  The  shape  of  the  calibration  curves  (driven  by  slopes 
<  1)  is  typical  of  overfitting — low  predicted  probabilities  are  too  low  and  high 
predicted  probabilities  are  too  high.  Predictions  near  the  overall  prevalence 
of  the  outcome  tend  to  be  calibrated  even  when  overfitting  is  present. 
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“Honest”  calibration  curves  may  also  be  estimated  using  nonparametric 
smoothers  in  conjunction  with  bootstrapping  and  cross-validation  (see 
Section  10.11). 


10.10  Describing  the  Fitted  Model 

Once  the  proper  variables  have  been  modeled  and  all  model  assumptions  have 
been  met,  the  analyst  needs  to  present  and  interpret  the  fitted  model.  There 
are  at  least  three  ways  to  proceed.  The  coefficients  in  the  model  may  be 
interpreted.  For  each  variable,  the  change  in  log  odds  for  a  sensible  change  in 
the  variable  value  (e.g.,  interquartile  range)  may  be  computed.  Also,  the  odds 
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Fig.  10.17  Estimated  logistic  calibration  (reliability)  curves  obtained  by  bootstrap¬ 
ping  three  modeling  strategies. 


Table  10.12  Effects  Response  :  sigdz 


Low  High  A 

Effect  S.E. 

Lower  0.95  Upper  0.95 

age 

46 

59  13 

0.90629  0.18381 

0.546030 

1.26650 

Odds  Ratio 

46 

59  13 

2.47510 

1.726400 

3.54860 

cholesterol 

196 

259  63 

0.75479  0.13642 

0.487410 

1.02220 

Odds  Ratio 

196 

259  63 

2.12720 

1.628100 

2.77920 

sex  female:  male 

1 

2 

-2.42970  0.14839 

-2.720600 

-2.13890 

Odds  Ratio 

1 

2 

0.08806 

0.065837 

0.11778 

ratio  or  factor  by  which  the  odds  increases  for  a  certain  change  in  a  predictor, 
holding  all  other  predictors  constant,  may  be  displayed.  Table  10.12  contains 
such  summary  statistics  for  the  linear  age  x  cholesterol  interaction  surface 
fit  described  in  Section  10.5. 


The  outer  quartiles  of  age  are  46  and  59  years,  so  the  “half-sample”  odds 
ratio  for  age  is  2.47,  with  0.95  confidence  interval  [1.63,  3.74]  when  sex  is  male 
and  cholesterol  is  set  to  its  median.  The  effect  of  increasing  cholesterol  from 
196  (its  lower  quartile)  to  259  (its  upper  quartile)  is  to  increase  the  log  odds 
by  0.79  or  to  increase  the  odds  by  a  factor  of  2.21.  Since  there  are  interactions 
allowed  between  age  and  sex  and  between  age  and  cholesterol,  each  odds  ratio 
in  the  above  table  depends  on  the  setting  of  at  least  one  other  factor.  The 
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Fig.  10.18  Odds  ratios  and  confidence  bars,  using  quart iles  of  age  and  cholesterol 
for  assessing  their  effects  on  the  odds  of  coronary  disease 


results  are  shown  graphically  in  Figure  10.18.  The  shaded  confidence  bars 
show  various  levels  of  confidence  and  do  not  pin  the  analyst  down  to,  say,  the 
0.95  level. 

For  those  used  to  thinking  in  terms  of  odds  or  log  odds,  the  preceding 
description  may  be  sufficient.  Many  prefer  instead  to  interpret  the  model  in 
terms  of  predicted  probabilities  instead  of  odds.  If  the  model  contains  only 
a  single  predictor  (even  if  several  spline  terms  are  required  to  represent  that 
predictor),  one  may  simply  plot  the  predictor  against  the  predicted  response. 
Such  a  plot  is  shown  in  Figure  10.19  which  depicts  the  fitted  relationship 
between  age  of  diagnosis  and  the  probability  of  acute  bacterial  meningitis 
(ABM)  as  opposed  to  acute  viral  meningitis  (AVM),  based  on  an  analysis  of 
422  cases  from  Duke  University  Medical  Center.580  The  data  may  be  found 
on  the  web  site.  A  linear  spline  function  with  knots  at  1,  2,  and  22  years  was 
used  to  model  this  relationship. 

When  the  model  contains  more  than  one  predictor,  one  may  graph  the  pre¬ 
dictor  against  log  odds,  and  barring  interactions,  the  shape  of  this  relationship 
will  be  independent  of  the  level  of  the  other  predictors.  When  displaying  the 
model  on  what  is  usually  a  more  interpretable  scale,  the  probability  scale,  a 
difficulty  arises  in  that  unlike  log  odds  the  relationship  between  one  predictor 
and  the  probability  of  response  depends  on  the  levels  of  all  other  factors.  For 
example,  in  the  model 


there  is  no  way  to  factor  out  X\  when  examining  the  relationship  between 
X2  and  the  probability  of  a  response.  For  the  two-predictor  case  one  can  plot 
X2  versus  predicted  probability  for  each  level  of  X\.  When  it  is  uncertain 
whether  to  include  an  interaction  in  this  model,  consider  presenting  graphs 
for  two  models  (with  and  without  interaction  terms  included)  as  was  done 
in  [658]. 
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Fig.  10.19  Linear  spline  fit  for  probability  of  bacterial  versus  viral  meningitis  as  a 
function  of  age  at  onset580.  Points  are  simple  proportions  by  age  quantile  groups. 


When  three  factors  are  present,  one  could  draw  a  separate  graph  for  each 
level  of  Xs,  a  separate  curve  on  each  graph  for  each  level  of  Xi,  and  vary  X2 
on  the  x-axis.  Instead  of  this,  or  if  more  than  three  factors  are  present,  a  good 
way  to  display  the  results  may  be  to  plot  “adjusted  probability  estimates”  as 
a  function  of  one  predictor,  adjusting  all  other  factors  to  constants  such  as 
the  mean.  For  example,  one  could  display  a  graph  relating  serum  cholesterol 
to  probability  of  myocardial  infarction  or  death,  holding  age  constant  at  55, 
sex  at  1  (male),  and  systolic  blood  pressure  at  120  mmHg. 

The  final  method  for  displaying  the  relationship  between  several  predictors 
and  probability  of  response  is  to  construct  a  nomogram.40, 254  A  nomogram 
not  only  sheds  light  on  how  the  effect  of  one  predictor  on  the  probability  of 
response  depends  on  the  levels  of  other  factors,  but  it  allows  one  to  quickly 
estimate  the  probability  of  response  for  individual  subjects.  The  nomogram 
in  Figure  10.20  allows  one  to  predict  the  probability  of  acute  bacterial  menin¬ 
gitis  (given  the  patient  has  either  viral  or  bacterial  meningitis)  using  the  same 
sample  as  in  Figure  10.19.  Here  there  are  four  continuous  predictor  values, 
none  of  which  are  linearly  related  to  log  odds  of  bacterial  meningitis:  age 
at  admission  (expressed  as  a  linear  spline  function),  month  of  admission  (ex¬ 
pressed  as  | month  —  8|),  cerebrospinal  fluid  glucose/blood  glucose  ratio  (linear 
effect  truncated  at  .6;  that  is,  the  effect  is  the  glucose  ratio  if  it  is  <  .6,  and  .6 
if  it  exceeded  .6),  and  the  cube  root  of  the  total  number  of  polymorphonuclear 
leukocytes  in  the  cerebrospinal  fluid. 

The  model  associated  with  Figure  10.14  is  depicted  in  what  could  be  called 
a  “precision  nomogram”  in  Figure  10.21.  Discrete  cholesterol  levels  were  re¬ 
quired  because  of  the  interaction  between  two  continuous  variables. 
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Fig.  10.20  Nomogram  for  estimating  probability  of  bacterial  (ABM)  versus  viral 
(AVM)  meningitis.  Step  1,  place  ruler  on  reading  lines  for  patient’s  age  and  month 
of  presentation  and  mark  intersection  with  line  A;  step  2,  place  ruler  on  values  for 
glucose  ratio  and  total  polymorphonuclear  leukocyte  (PMN)  count  in  cerebrospinal 
fluid  and  mark  intersection  with  line  B;  step  3,  use  ruler  to  join  marks  on  lines  A  and 
B,  then  read  off  the  probability  of  ABM  versus  AVM.580 


# 

Draw  a  nomogram  that  s 

hows  examp 

l  es  of 

con 

fiden 

ce  in 

L 

tervals 

nom  nomogram ( f . linia , 

cholest er 

ol =  seq 

C 150 

,  400 

,  by  = 

50)  , 

interact 

=  1  i  st  (  age  = 

seq  (30 

70 

,  by  = 

10))  , 

lp . at  =  se 

q  (-2  ,  3.5, 

by = . 5 ) 

, 

conf . int 

=TRUE ,  conf . lp=" all " 

9 

f un=f unc 

tion (x) 1/  ( 

1  +  exp  ( - 

-x)) 

,  # 

or  pi 

o  g  is 

f unlabel 

= " Probabil 

i  t  y  of 

CAD 

II 

9 

fun . at  =  c 

( seq ( . 1  ,  . 

9  ,  by  = 

.1) , 

.95  , 

.99) 

)  #  Fi 

gure  10.21 

pi 

ot (nom ,  col. grid  =  gray(c(0.8,  0 

.95)) , 

varname . label =FALSE 

,  ia . space 

=1 ,  xf rac = 

.46  , 

Imgp  = 

.2) 

10.11  R  Functions 


269 


10.11  R  Functions 


The  general  R  statistical  modeling  functions96  described  in  Section  6.2  work 
with  the  author’s  lrm  function  for  fitting  binary  and  ordinal  logistic  regres¬ 
sion  models,  lrm  has  several  options  for  doing  penalized  maximum  likelihood 
estimation,  with  special  treatment  of  categorical  predictors  so  as  to  shrink 
all  estimates  (including  the  reference  cell)  to  the  mean.  The  following  exam¬ 
ple  fits  a  logistic  model  containing  predictors  age,  blood. pressure,  and  sex, 
with  age  fitted  with  a  smooth  five-knot  restricted  cubic  spline  function  and  a 
different  shape  of  the  age  relationship  for  males  and  females. 

L 

fit  V-  lrm(death  ~  blood . pressure  +  sex  *  rcs(age,5)) 
anova  (fit) 

plot ( Pr edi ct  (  f it  ,  age,  sex)) 


The  pentrace  function  makes  it  easy  to  check  the  effects  of  a  sequence  of 
penalties.  The  following  code  fits  an  unpenalized  model  and  plots  the  AIC 
and  Schwarz  BIC  for  a  variety  of  penalties  so  that  approximately  the  best 
cross-validating  model  can  be  chosen  (and  so  we  can  learn  how  the  penalty 
relates  to  the  effective  degrees  of  freedom).  Here  we  elect  to  only  penalize  the 
nonlinear  or  non-additive  parts  of  the  model. 

L 

f  V-  lrm (death  ~  res ( age ,5)*treatment  +  lsp(sbp,c(120,140)), 
x=TRUE  ,  y=TRUE ) 
plot (pentrace (f , 

penalty=list (nonline ar=seq( .25 ,10, by =.25)))  ) 


See  Sections  9.8.1  and  9.10  for  more  information. 

The  residuals  function  for  lrm  and  the  which,  influence  function  can  be 
used  to  check  predictor  transformations  as  well  as  to  analyze  overly  influential 
observations  in  binary  logistic  regression.  See  Figure  10.16  for  one  application. 
The  residuals  function  will  also  perform  the  unweighted  sum  of  squares  test 
for  global  goodness  of  fit  described  in  Section  10.5. 

The  validate  function  when  used  on  an  object  created  by  lrm  does  resam¬ 
pling  validation  of  a  logistic  regression  model,  with  or  without  backward 
step-down  variable  deletion.  It  provides  bias-corrected  Somers’  Dxy  rank 
correlation,  R ^  index,  the  intercept  and  slope  of  an  overall  logistic  calibra¬ 
tion  equation,  the  maximum  absolute  difference  in  predicted  and  calibrated 
probabilities  ^ma^,  the  discrimination  index  D  [(model  L.R.  y2  —  1  )/n],  the 
unreliability  index  U  =  (difference  in  —2  log  likelihood  between  uncalibrated 
Xp  and  X/3  with  overall  intercept  and  slope  calibrated  to  test  sample) /n, 
and  the  overall  quality  index  Q  =  D  —  U ,267  The  “corrected”  slope  can 
be  thought  of  as  a  shrinkage  factor  that  takes  overfitting  into  account.  See 
predab. resample  in  Section  6.2  for  the  list  of  resampling  methods. 

The  calibrate  function  produces  bootstrapped  or  cross-validated  calibra¬ 
tion  curves  for  logistic  and  linear  models.  The  “apparent”  calibration  accuracy 
is  estimated  using  a  nonparametric  smoother  relating  predicted  probabilities 
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Fig.  10.21  Nomogram  relating  age,  sex,  and  cholesterol  to  the  log  odds  and  to 
the  probability  of  significant  coronary  artery  disease.  Select  one  axis  corresponding 
to  sex  and  to  age  E  {30,  40,  50,  60,  70}.  There  is  linear  interaction  between  age  and 
sex  and  between  age  and  cholesterol.  0.70  and  0.90  confidence  intervals  are  shown 
(0.90  in  gray).  Note  that  for  the  “Linear  Predictor”  scale  there  are  various  lengths 

A 

of  confidence  intervals  near  the  same  value  of  X(3,  demonstrating  that  the  standard 

A 

error  of  X (3  depends  on  the  individual  X  values.  Also  note  that  confidence  intervals 
corresponding  to  smaller  patient  groups  (e.g.,  females)  are  wider. 


to  observed  binary  outcomes.  The  nonpar ametric  estimate  is  evaluated  at  a 
sequence  of  predicted  probability  levels.  Then  the  distances  from  the  45°  line 
are  compared  with  the  differences  when  the  current  model  is  evaluated  back 
on  the  whole  sample  (or  omitted  sample  for  cross-validation).  The  differences 
in  the  differences  are  estimates  of  overoptimism.  After  averaging  over  many 
replications,  the  predicted- value-specific  differences  are  then  subtracted  from 
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the  apparent  differences  and  an  adjusted  calibration  curve  is  obtained.  Un¬ 
like  validate,  calibrate  does  not  assume  a  linear  logistic  calibration.  For  an 
example,  see  the  end  of  Chapter  11.  calibrate  will  print  the  mean  absolute 
calibration  error,  the  0.9  quantile  of  the  absolute  error,  and  the  mean  squared 
error,  all  over  the  observed  distribution  of  predicted  values. 

The  val.prob  function  is  used  to  compute  measures  of  discrimination  and 
calibration  of  predicted  probabilities  for  a  separate  sample  from  the  one 
used  to  derive  the  probability  estimates.  Thus  val.prob  is  used  in  exter¬ 
nal  validation  and  data-splitting.  The  function  computes  similar  indexes  as 
validate  plus  the  Brier  score  and  a  statistic  for  testing  for  unreliability  or 
•  7o  =  0,7i  =  1. 

In  the  following  example,  a  logistic  model  is  fitted  on  100  observations 
simulated  from  the  actual  model  given  by 


Prob{T  =  1\XUX2,XS}  =  [l  +  exp[—(— 1  +  2X0 


(10.40) 


where  X\  is  a  random  uniform  [0, 1]  variable.  Hence  X2  and  X3  are  irrelevant. 
After  fitting  a  linear  additive  model  in  Xi,X2,  and  X3,  the  coefficients  are 
used  to  predict  Prob{T  =  1}  on  a  separate  sample  of  100  observations. 

set . seed  (13) 
n  ^ —  200 
xl  V-  runif  (n) 
x2  V-  runif  (n) 
x3  V-  runif (n) 
logit  V-  2*(xl-.5) 

P  V-  1 /( 1+ exp ( -logit  )  ) 

y  V-  if else ( runif (n)  <  P,  1,  0) 

d  V-  dat a  .  f  r ame  ( xl  ,  x2  ,  x3  ,  y) 

f  V-  lrm(y  ~  xl  +  x2  +  x3 ,  subset =1 : 100) 

phat  V-  predict(f,  d[101:200,],  type  = 'fitted') 

#  Figure  10.22 

v  V-  val  .  pr  ob  ( phat  ,  y  [101:200]  ,  m  =  20  ,  cex=.5) 

The  output  is  shown  in  Figure  10.22. 

The  R  built-in  function  glm,  a  very  general  modeling  function,  can  fit  binary 
logistic  models.  The  response  variable  must  be  coded  0/1  for  glm  to  work.  Glm 
is  a  slight  modification  of  the  built-in  glm  function  in  the  rms  package  that 
allows  fits  to  use  rms  methods.  This  facilitates  Poisson  and  several  other  types 
of  regression  analysis. 
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See 

See 


590 

632 


for  modeling  strategies  specific  to  binary  logistic  regression, 
for  a  nice  review  of  logistic  modeling.  Agresti6  is  an  excellent  source 
for  categorical  Y  in  general. 

Not  only  does  discriminant  analysis  assume  the  same  regression  model  as  lo¬ 
gistic  regression,  but  it  also  assumes  that  the  predictors  are  each  normally 
distributed  and  that  jointly  the  predictors  have  a  multivariate  normal  distri¬ 
bution.  These  assumptions  are  unlikely  to  be  met  in  practice,  especially  when 
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Fig.  10.22  Validation  of  a  logistic  model  in  a  test  sample  of  size  n  =  100.  The 
calibrated  risk  distribution  (histogram  of  logistic-calibrated  probabilities)  is  shown. 
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one  of  the  predictors  is  a  discrete  variable  such  as  sex  group.  When  discrimi¬ 
nant  analysis  assumptions  are  violated,  logistic  regression  yields  more  accurate 
estimates.251, 514  Even  when  discriminant  analysis  is  optimal  (i.e.,  when  all 
its  assumptions  are  satisfied)  logistic  regression  is  virtually  as  accurate  as  the 
discriminant  model.264 

See  [573]  for  a  review  of  measures  of  effect  for  binary  outcomes. 

Cepedaet  al.95  found  that  propensity  adjustment  is  better  than  covariate  ad¬ 
justment  with  logistic  models  when  the  number  of  events  per  variable  is  less 
than  8. 

Pregibon512  developed  a  modification  of  the  log  likelihood  function  that  when 
maximized  results  in  a  fit  that  is  resistant  to  overly  influential  and  outlying 
observations. 

See  Hosmer  and  Lemeshow306  for  methods  of  testing  for  a  difference  in  the 
observed  event  proportion  and  the  predicted  event  probability  (average  of  pre¬ 
dicted  probabilities)  for  a  group  of  heterogeneous  subjects. 

See  Hosmer  and  Lemeshow,305  Kay  and  Little,341  and  Collett  [115,  Chap.  5]. 
Landwehr  et  al.373  proposed  the  partial  residual  (see  also  Fowlkes199). 

See  Berk  and  Booth51  for  other  partial-like  residuals. 

See  [341]  for  an  example  comparing  a  smoothing  method  with  a  parametric 
logistic  model  fit. 

See  Collett  [115,  Chap.  5]  and  Pregibon512  for  more  information  about  influence 
statistics.  Pregibon’s  resistant  estimator  of  (3  handles  overly  influential  groups 
of  observations  and  allows  one  to  estimate  the  weight  that  an  observation  con¬ 
tributed  to  the  fit  after  making  the  fit  robust.  Observations  receiving  low  weight 
are  partially  ignored  but  are  not  deleted. 

Buyse86  showed  that  in  the  case  of  a  single  categorical  predictor,  the  ordi¬ 
nary  R2  has  a  ready  interpretation  in  terms  of  variance  explained  for  binary 
responses.  Menard454  studied  various  indexes  for  binary  logistic  regression.  He 
criticized  R ^  for  being  too  dependent  on  the  proportion  of  observations  with 
Y  —  1.  Hu  et  al.309  further  studied  the  properties  of  variance-based  R2  mea¬ 
sures  for  binary  responses.  Tjur613  has  a  nice  discussion  discrimination  graphics 
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and  sum  of  squares-based  R 2  measures  for  binary  logistic  regression,  as  well 
as  a  good  discussion  of  “separation”  and  infinite  regression  coefficients.  Sums  of 
squares  are  approximated  various  ways. 

Very  little  work  has  been  done  on  developing  adjusted  R 2  measures  in  logistic 
regression  and  other  non-linear  model  setups.  Liao  and  McGee406  developed 
one  adjusted  R2  measure  for  binary  logistic  regression,  but  it  uses  simulation  to 
adjust  for  the  bias  of  overfitting.  One  might  as  well  use  the  bootstrap  to  adjust 
any  of  the  indexes  discussed  in  this  section. 

[123,633]  have  more  pertinent  discussion  of  probability  accuracy  scores. 
Copas121  demonstrated  how  ROC  areas  can  be  misleading  when  applied  to 
different  responses  having  greatly  different  prevalences.  He  proposed  another 
approach,  the  logit  rank  plot.  Newsom4'3  is  an  excellent  reference  on  Dxy. 
Newson474  developed  several  generalizations  to  Dxy  including  a  stratified  ver¬ 
sion,  and  discussed  the  jackknife  variance  estimator  for  them.  ROC  areas  are 
not  very  useful  for  comparing  two  models118, 493  (but  see490). 

Gneiting  and  Raftery219  have  an  excellent  review  of  proper  scoring  rules. 
Hand253  contains  much  information  about  assessing  classification  accuracy. 
Mittlbock  and  Schemper461  have  an  excellent  review  of  indexes  of  explained 
variation  for  binary  logistic  models.  See  also  Korn  and  Simon366  and  Zheng 
and  Agresti.684. 

Pryor  et  al.515  presented  nomograms  for  a  10- variable  logistic  model.  One  of  the 
variables  was  sex,  which  interacted  with  some  of  the  other  variables.  Evaluation 
of  predicted  probabilities  was  simplified  by  the  construction  of  separate  nomo¬ 
grams  for  females  and  males.  Seven  terms  for  discrete  predictors  were  collapsed 
into  one  weighted  point  score  axis  in  the  nomograms,  and  age  by  risk  factor 
interactions  were  captured  by  having  four  age  scales. 

Moons  et  al.462  presents  a  case  study  in  penalized  binary  logistic  regression 
modeling. 

The  rcspline .plot  function  in  the  Hmisc  R  package  does  not  allow  for  in¬ 
teractions  as  does  lrm,  but  it  can  provide  detailed  output  for  checking  spline 
fits.  This  function  plots  the  estimated  spline  regression  and  confidence  limits, 
placing  summary  statistics  on  the  graph.  If  there  are  no  adjustment  variables, 
rcspline  .plot  can  also  plot  two  alternative  estimates  of  the  regression  func¬ 
tion:  proportions  or  logit  proportions  on  grouped  data,  and  a  nonparametric 
estimate.  The  nonparametric  regression  estimate  is  based  on  smoothing  the  bi¬ 
nary  responses  and  taking  the  logit  transformation  of  the  smoothed  estimates,  if 
desired.  The  smoothing  uses  the  “super  smoother”  of  Friedman207  implemented 
in  the  R  function  supsmu. 


10.13  Problems 

1.  Consider  the  age-sex-response  example  in  Section  10.1.3.  This  dataset  is 

available  from  the  text’s  web  site  in  the  Datasets  area. 

a.  Duplicate  the  analyses  done  in  Section  10.1.3. 

b.  For  the  model  containing  both  age  and  sex,  test  Hq  :  logit  response  is 
linear  in  age  versus  Ha  :  logit  response  is  quadratic  in  age.  Use  the  best 
test  statistic. 

c.  Using  a  Wald  test,  test  Hq  :  no  age  x  sex  interaction.  Interpret  all 
parameters  in  the  model. 
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d.  Plot  the  estimated  logit  response  as  a  function  of  age  and  sex,  with  and 
without  fitting  an  interaction  term. 

e.  Perform  a  likelihood  ratio  test  of  Hq  :  the  model  containing  only  age 
and  sex  is  adequate  versus  Ha  :  model  is  inadequate.  Here,  “inadequate” 
may  mean  nonlinearity  (quadratic)  in  age  or  presence  of  an  interaction. 

f.  Assuming  no  interaction  is  present,  test  Hq  :  model  is  linear  in  age  versus 
Ha  :  model  is  nonlinear  in  age.  Allow  “nonlinear”  to  be  more  general 
than  quadratic.  (Hint:  use  a  restricted  cubic  spline  function  with  knots 
at  age=39,  45,  55,  64  years.) 

g.  Plot  age  against  the  estimated  spline  transformation  of  age  (the  trans¬ 
formation  that  would  make  age  fit  linearly).  You  can  set  the  sex  and 
intercept  terms  to  anything  you  choose.  Also  plot  Probjresponse  =  1 
age,  sex}  from  this  fitted  restricted  cubic  spline  logistic  model. 

2.  Consider  a  binary  logistic  regression  model  using  the  following  predictors: 
age  (years),  sex,  race  (white,  African-American,  Hispanic,  Oriental,  other), 
blood  pressure  (mmHg).  The  fitted  model  is  given  by 

logit  Prob[Y  =  1\X]  =  Xft  =  —1.36  +  .03(race  =  African-American) 

—  .04(race  =  hispanic)  +  .05(race  =  oriental)  —  .06(race  =  other) 

+  .07|blood  pressure  —  1 10 1  +  .3(sex  =  male)  —  .lage  +  .002age2  + 

(sex  =  male)[.05age  —  .003age2]. 

a.  Compute  the  predicted  logit  (log  odds)  that  Y  =  1  for  a  50-year-old 
female  Hispanic  with  a  blood  pressure  of  90  mmHg.  Also  compute  the 
odds  that  Y  =  1  (Prob[Y  =  l]/Prob[Y  =  0])  and  the  estimated  proba¬ 
bility  that  Y  =  1. 

b.  Estimate  odds  ratios  for  each  nonwhite  race  compared  with  the  ref¬ 
erence  group  (white),  holding  all  other  predictors  constant.  Why  can 
you  estimate  the  relative  effect  of  race  for  all  types  of  subjects  without 
specifying  their  characteristics? 

c.  Compute  the  odds  ratio  for  a  blood  pressure  of  120  mmHg  compared 
with  a  blood  pressure  of  105,  holding  age  first  to  30  years  and  then  to 
40  years. 

d.  Compute  the  odds  ratio  for  a  blood  pressure  of  120  mmHg  compared 
with  a  blood  pressure  of  105,  all  other  variables  held  to  unspecified 
constants.  Why  is  this  relative  effect  meaningful  without  knowing  the 
subject’s  age,  race,  or  sex? 

e.  Compute  the  estimated  risk  difference  in  changing  blood  pressure  from 
105  mmHg  to  120  mmHg,  first  for  age  =  30  then  for  age  =  40,  for  a 
white  female.  Why  does  the  risk  difference  depend  on  age? 

f.  Compute  the  relative  odds  for  males  compared  with  females,  for  age  =  50 
and  other  variables  held  constant. 

g.  Same  as  the  previous  question  but  for  females  :  males  instead  of  males 
:  females. 

h.  Compute  the  odds  ratio  resulting  from  increasing  age  from  50  to  55 
for  males,  and  then  for  females,  other  variables  held  constant.  What  is 
wrong  with  the  following  question:  What  is  the  relative  effect  of  chang¬ 
ing  age  by  one  year? 


Chapter  11 

Case  Study  in  Binary  Logistic  Regression, 
Model  Selection  and  Approximation: 
Predicting  Cause  of  Death 


11.1  Overview 

This  chapter  contains  a  case  study  on  developing,  describing,  and  validating 
a  binary  logistic  regression  model.  In  addition,  the  following  methods  are 
exemplified: 

1.  Data  reduction  using  incomplete  linear  and  nonlinear  principal  compo¬ 
nents 

2.  Use  of  AIC  to  choose  from  five  modeling  variations,  deciding  which  is  best 
for  the  number  of  parameters 

3.  Model  simplification  using  stepwise  variable  selection  and  approximation 
of  the  full  model 

4.  The  relationship  between  the  degree  of  approximation  and  the  degree  of 
predictive  discrimination  loss 

5.  Bootstrap  validation  that  includes  penalization  for  model  uncertainty 
(variable  selection)  and  that  demonstrates  a  loss  of  predictive  discrimi¬ 
nation  over  the  full  model  even  when  compensating  for  overfitting  the  full 
model. 

The  data  reduction  and  pre-transformation  methods  used  here  were  discussed 
in  more  detail  in  Chapter  8.  Single  imputation  will  be  used  because  of  the 
limited  quantity  of  missing  data. 


11.2  Background 

Consider  the  randomized  trial  of  estrogen  for  treatment  of  prostate  cancer87 
described  in  Chapter  8.  In  this  trial,  larger  doses  of  estrogen  reduced  the  effect 
of  prostate  cancer  but  at  the  cost  of  increased  risk  of  cardiovascular  death. 
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Kay340  did  a  formal  analysis  of  the  competing  risks  for  cancer,  cardiovascular, 
and  other  deaths.  It  can  also  be  quite  informative  to  study  how  treatment 
and  baseline  variables  relate  to  the  cause  of  death  for  those  patients  who 
died.3 7 6  We  subset  the  original  dataset  of  those  patients  dying  from  prostate 
cancer  [n  —  130),  heart  or  vascular  disease  (n  =  96),  or  cerebrovascular 
disease  (n  =  31).  Our  goal  is  to  predict  cardiovascular-cerebrovascular  death 
(cvd,  n  =  127)  given  the  patient  died  from  either  cvd  or  prostate  cancer.  Of 
interest  is  whether  the  time  to  death  has  an  effect  on  the  cause  of  death,  and 
whether  the  importance  of  certain  variables  depends  on  the  time  of  death. 


11.3  Data  Transformations  and  Single  Imputation 

In  R,  first  obtain  the  desired  subset  of  the  data  and  do  some  preliminary 
calculations  such  as  combining  an  infrequent  category  with  the  next  category, 
and  dichotomizing  ekg  for  use  in  ordinary  principal  components  (PCs). 

L 

require ( rms ) 


getHdata (prostate) 
prostate  V- 

within (prostate  ,  { 

levels  (ekg)  [levels  (ekg)  0/0in0/0 

c  (  'old  MI  '  ,  '  recent  MI')]  <-  'MI' 

ekg .  norm  V-  1  * ( ekg  0/0in0/0  c(  'normal',  'benign')) 
levels  (ekg)  V-  abbreviate  (  levels  (  ekg  )  ) 
pfn  V-  as . numer i c ( pf ) 

levels (pf )  V-  levels  (pf  )  [  c  ( 1 , 2 , 3 , 3)  ] 

cvd  V-  status  0/0in°/o  c("dead  -  heart  or  vascular", 

"dead  -  cerebrovascular") 
rxn  =  as  .  numer i c  ( rx )  }) 

#  Use  transcan  to  compute  optimal  pre-transformations 
ptrans  V-  #  See  Figure  8.3 

transcan  C  sz  +  sg  +  ap  +  sbp  +  dbp  + 

age  +  wt  +  hg  +  ekg  +  pf  +  bm  +  hx  +  dtime  +  rx  , 
imputed=TRUE ,  transf ormed =TRUE , 
dat a=prost at e ,  pl=FALSE ,  pr=FALSE) 

#  Use  transcan  single  imputations 

imp  V-  impute (ptrans ,  dat a=prost at e ,  1 i s t . out =TRUE ) 


input  ed 

miss 

ing 

values  with  the  following 

f  r  e 

quenc i e  s 

and 

stored 

them 

in  variables  with  their 

orig 

inal  names : 

sz 

sg 

age 

wt 

ekg 

5 

11 

1 

2 

8 

NAvar s 

all . var 

S  p  sz  +  sg  + 

age 

L 

+  wt  +  ekg) 

f  or  ( x 

in 

NAvars  ) 

pro  st  at  e  [  [x] ] 

imp  [  [x]  ] 

subset 

pro  st  at 

e$  status  0/0in°/o 

c  (  " 

dead  -  heart  or  vascular", 

11.4  Principal  Components,  Pretransformations 
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"dead  -  cerebrovascular " , "dead  -  prostatic  ca") 
trans  V-  ptrans $ transf ormed [ subset , ] 
psub  V-  prost  at  e  [  subset  ,  ] 


11.4  Regression  on  Original  Variables,  Principal 
Components  and  Pretransformations 

We  first  examine  the  performance  of  data  reduction  in  predicting  the  cause 
of  death,  similar  to  what  we  did  for  survival  time  in  Section  8.6.  The  first 
analyses  assess  how  well  PCs  (on  raw  and  transformed  variables)  predict  the 
cause  of  death. 

There  are  127  cvds.  We  use  the  15:1  rule  of  thumb  discussed  on  P.  72  to 
justify  using  the  first  8  PCs.  ap  is  log-transformed  because  of  its  extreme 
distribution. 


# 

Function  to  compute 

the  first  k  PCs 

L 

i] 

dc  V-  function  (x,  k  =  l 

5  •  •  •  ) 

princomp (x,  .  .  .  ,  cor 

=TRUE ) $ 

scores  [  ,  1 

:  k] 

# 

Compute  the  first  8 

PCs  on 

raw  varia 

bles 

then  on 

# 

transformed  ones 

P< 

z8  «—  ipc  (~  sz  +  sg  + 

log  ( ap 

)  +  sbp  + 

dbp 

+  age  + 

wt  +  hg  +  e 

kg . norm 

+  pf  n  + 

bm  + 

hx  +  rxn  + 

dtime  , 

data=psub , 

k  =  8) 

f8  V-  lrm  (  cvd  ~  pc8  , 

dat  a  =  p 

sub  ) 

P 

z8t  <—  ipc  (trans  ,  k  =  8 

) 

f8t  V-  lrm  (  cvd  ~  pc8t 

,  dat  a  = 

psub  ) 

# 

Fit  binary  logistic 

model  o 

n  origina 

l  variables 

f 

V-  lrm  (cvd  ~  sz  +  sg 

+  log  ( 

ap)  +  sbp 

+  dbp  +  age  + 

wt  +  hg  +  ekg 

+  pf  + 

bm  +  hx 

+  rx 

+  dtime  ,  data  =  psub) 

# 

Expand  continuous  variables 

using  sp 

line 

s 

g 

V-  lrm  (cvd  ~  res  (sz  , 

4)  +  r  c 

s ( sg , 4)  + 

res 

(log ( ap )  ,4) 

+ 

res ( sbp  ,4)  + 

res (dbp 

,4)  +  res 

( age 

,4)  +  res  ( wt  ,  4)  + 

rcs(hg,4)  +  e 

kg  +  pf 

+  bm  +  hx  + 

rx  +  r cs  ( dt 

ime  ,  4)  , 

dat  a  =  psub ) 

# 

Fit  binary  logistic 

model  o 

n  individual 

transformed 

var . 

h 

V-  lrm (cvd  ~  trans  , 

dat  a  =  ps 

ub) 

The  five  approaches  to  modeling  the  outcome 
smaller  is  better). 

are  compared  using  AIC  (where 

c 

(f8=AIC(f8),  f8t=AIC(f8t) , 

f =AIC (f ) 

>  g= 

=AIC(g),  h= 

AIC (h) ) 

f  8  f  8t  f  g  h 

257.6573  254.5172  255.8545  263.8413  254.5317 

Based  on  AIC,  the  more  traditional  model  fitted  to  the  raw  data  and  as¬ 
suming  linearity  for  all  the  continuous  predictors  has  only  a  slight  chance 
of  producing  worse  cross-validated  predictive  accuracy  than  other  methods. 
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The  chances  are  also  good  that  effect  estimates  from  this  simple  model  will 
have  competitive  mean  squared  errors. 


11.5  Description  of  Fitted  Model 

Here  we  describe  the  simple  all-linear  full  model.  Summary  statistics  and  a 
Wald-ANOVA  table  are  below,  followed  by  partial  effects  plots  with  pointwise 
confidence  bands,  and  odds  ratios  over  default  ranges  of  predictors. 

print  (f  ,  latex=TRUE) 


Logistic  Regression  Model 

lrm (formula  =  cvd  sz  +  sg  +  log(ap)  +  sbp  +  dbp  +  age  +  wt  + 
hg  +  ekg  +  pf  +  bm  +  hx  +  rx  +  dtime,  data  =  psub) 


Model  Likelihood 
Ratio  Test 

Discrimination 

Indexes 

Rank  Discrim. 
Indexes 

Obs  257 

FALSE  130 

TRUE  127 

max  -  d |  -  6xl0-11 

LR?  144.39 

d.f.  21 

Pr(>  x2)  <  0.0001 

H2  0.573 

g  2.688 

gr  14.701 

gp  0.394 

Brier  0.133 

C  0.893 

Dxy  0.786 

7  0.787 

ra  0.395 

Coef  S.E.  Wald  Z  Pr(>  \Z\) 


Intercept 

-4.5130 

3.2210 

-1.40 

0.1612 

sz 

-0.0640 

0.0168 

-3.80 

0.0001 

sg 

-0.2967 

0.1149 

-2.58 

0.0098 

ap 

-0.3927 

0.1411 

-2.78 

0.0054 

sbp 

-0.0572 

0.0890 

-0.64 

0.5201 

dbp 

0.3917 

0.1629 

2.40 

0.0162 

age 

0.0926 

0.0286 

3.23 

0.0012 

wt 

-0.0177 

0.0140 

-1.26 

0.2069 

hg 

0.0860 

0.0925 

0.93 

0.3524 

ekg=bngn 

1.0781 

0.8793 

1.23 

0.2202 

ekg=rd&ec 

-0.1929 

0.6318 

-0.31 

0.7601 

ekg=hbocd 

-1.3679 

0.8279 

-1.65 

0.0985 

ekg=hrts 

0.4365 

0.4582 

0.95 

0.3407 

ekg=MI 

0.3039 

0.5618 

0.54 

0.5886 

pf=in  bed  <  50%  daytime 

0.9604 

0.6956 

1.38 

0.1673 

pf=in  bed  >  50%  daytime 

-2.3232 

1.2464 

-1.86 

0.0623 

bm 

0.1456 

0.5067 

0.29 

0.7738 

hx 

1.0913 

0.3782 

2.89 

0.0039 
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Coef  S.E.  Wald  Z  Pr(>  \Z\) 


rx=0.2  mg  estrogen 

-0.3022 

0.4908 

-0.62 

0.5381 

rx=1.0  mg  estrogen 

0.7526 

0.5272 

1.43 

0.1534 

rx=5.0  mg  estrogen 

0.6868 

0.5043 

1.36 

0.1733 

dtime 

-0.0136 

0.0107 

-1.27 

0.2040 

an  V-  anova  (  f  ) 

L 

lat ex ( an  ,  f ile  =  ' 

',  table . env=FALSE ) 

X2  d.f.  P 


sz 

sg 

ap 

sbp 

dbp 

age 

wt 

hg 

ekg 

pf 

bm 

hx 

rx 

dtime 

TOTAL 


14.42  1 

6.67  1 

7.74  1 

0.41  1 

5.78  1 

10.45  1 

1.59  1 

0.86  1 
6.76  5 

5.52  2 

0.08  1 
8.33  1 

5.72  3 

1.61  1 
66.87  21 


0.0001 
0.0098 
0.0054 
0.5201 
0.0162 
0.0012 
0.2069 
0.3524 
0.2391 
0.0632 
0.7738 
0.0039 
0.1260 
0.2040 
<  0.0001 


plot (an)  #  Figure  11.1 
s  V-  f$stats 

gamma. hat  V-  (s['Model  L.R.  ']  -  s['d.f.  '])/s['Model  L.R.  '] 


dd  V-  dat  adi  s t  (  psub  )  ;  opt i ons ( dat adi s t =  '  dd  '  ) 

ggplot  (Predict  (f)  ,  sepdiscrete= 'vertical  '  ,  vnames= 'names  '  , 
rdata  =  psub  , 

histSpike . opts =list (frac=f unction (f )  . 1 *f /max (f )  )) 

#  Figure  11.2 


plot ( summary ( f ) ,  log=TRUE)  #  Figure  11.3 

The  van  Houwelingen-Le  Cessie  heuristic  shrinkage  estimate  (Equation  4.3) 
is  7  =  0.85,  indicating  that  this  model  will  validate  on  new  data  about  15% 
worse  than  on  this  dataset. 
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Fig.  11.1  Ranking  of  apparent  importance  of  predictors  of  cause  of  death 


11.6  Backwards  Step-Down 


Now  use  fast  backward  step-down  (with  total  residual  AIC  as  the  stopping 
rule)  to  identify  the  variables  that  explain  the  bulk  of  the  cause  of  death. 
Later  validation  will  take  this  screening  of  variables  into  account. The  greatly 
reduced  model  results  in  a  simple  nomogram. 

f  astbw  ( f  ) 


Deleted 

Chi- 

Sq  d.f 

.  P 

Residual 

d.f. 

P 

AIC 

ekg 

6 . 76 

5 

0 . 2391 

6 . 76 

5 

0 . 2391 

-3 . 24 

bm 

0 . 09 

1 

0 . 7639 

6 . 85 

6 

0 . 3349 

-5 . 15 

hg 

0 . 38 

1 

0 . 5378 

7 . 23 

7 

0 . 4053 

-6 . 77 

sbp 

0 . 48 

1 

0 . 4881 

7 .71 

8 

0 . 4622 

-8 . 29 

wt 

1.11 

1 

0 . 2932 

8 . 82 

9 

0 . 4544 

-9 . 18 

dt  ime 

1 . 47 

1 

0 . 2253 

10 . 29 

10 

0 .4158 

-9 .71 

rx 

5 . 65 

3 

0 . 1302 

15 . 93 

13 

0 . 2528 

-10 . 07 

Pf 

4 . 78 

2 

0 . 0915 

20 .71 

15 

0 . 1462 

-9.29 

sg 

4 . 28 

1 

0 . 0385 

25 . 00 

16 

0 . 0698 

-7 . 00 

dbp 

5 . 84 

1 

0 .0157 

30 . 83 

17 

0 . 0209 

-3 . 17 

Approximate  Estimates  after 

Deleting 

Factors 

Coef 

S  .  E  . 

Wald  Z 

P 

Intercept 

-3  . 

74986 

1 . 82887 

-2 . 050 

0  . 

0403286 

sz 

-0  . 

04862 

0 .01532 

-3 . 174 

0  . 

0015013 

ap 

-0  . 

40694 

0.11117 

-3 . 660 

0  . 

0002518 

age 

0  . 

06000 

0 . 02562 

2 . 342 

0  . 

0191701 

hx 

0  . 

86969 

0 . 34339 

2 . 533 

0  . 

0113198 

Factors  in  Final  Model 
[1]  sz  ap  age  hx 
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in  bed  >  50%  daytime  - 
in  bed  <  50%  daytime  - 
normal  activity  - 


-6  -4  -2  0  2  4 


hx 

ekg 

Ml  - 

— «— 

hrts  - 

— -|  _ 

hbocd  - 

— + — 

rd&ec  - 

— • — 

bngn  - 

— • — 

nrml  - 

— •— 

5.0  mg  estrogen  - 
1 .0  mg  estrogen  - 
0.2  mg  estrogen  - 
placebo  - 


rx 


-6  -4  -2  0 


1  i 

2  4 


log  odds 


Fig.  11.2  Partial  effects  (log  odds  scale)  in  full  model  for  cause  of  death,  along  with 
vertical  line  segments  showing  the  raw  data  distribution  of  predictors 


fred  lrm(cvd  ~  sz  +  log(ap)  +  age  +  hx ,  data=psub) 

latex (fred,  file='  ') 


Probjcvd} 


1 

1  +  exp(— X/3)  ’ 


where 


xp  = 

-5.009276  -  0.05510121  sz  -  0.509185  log(ap)  +  0.0788052  age  +  1.070601  hx 
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Odds  Ratio 


SZ  -  25:6 
sg  -  12:9 
ap  -  7:0.5999756 
sbp  -  16:13 
dbp  -  9:7 
age  -  76:70 
wt  -  106:89 
hg  -  14.59961:12 
bm  -  1:0 
hx-  1:0 
dtime  -  37:1 1 
ekg  -  nrmkhrts 
ekg  -  bngn:hrts 
ekg  -  rd&ec:hrts 
ekg  -  hbocd:hrts 
ekg  -  Mkhrts 

pf  -  in  bed  <  50%  daytime: normal  activity 
pf  -  in  bed  >  50%  daytime: normal  activity 
rx  -  0.2  mg  estrogen:placebo 
rx  -  1.0  mg  estrogen:placebo 
rx  -  5.0  mg  estrogen:placebo 


0.10  0.50  2.00  8.00 

I _ UJ _ I _ LLI  1111,1 


Fig.  11.3  Interquartile-range  odds  ratios  for  continuous  predictors  and  simple  odds 
ratios  for  categorical  predictors.  Numbers  at  left  are  upper  quart ile  :  lower  quartile  or 
current  group  :  reference  group.  The  bars  represent  0.9,  0.95,  0.99  confidence  limits. 
The  intervals  are  drawn  on  the  log  odds  ratio  scale  and  labeled  on  the  odds  ratio 
scale.  Ranges  are  on  the  original  scale. 


nom  V-  nomogram ( f red ,  ap=c(.l,  .5,  1,  5,  10,  50), 

f un=plogis ,  fun lab el =" Probability" , 
fun.at=c(.01 , .05 , .1, .25 , .5, .75 , .9, .95 , .99)) 
plot (nom ,  xfrac=.45)  #  Figure  11.4 

It  is  readily  seen  from  this  model  that  patients  with  a  history  of  heart 
disease,  and  patients  with  less  extensive  prostate  cancer  are  those  more  likely 
to  die  from  cvd  rather  than  from  cancer.  But  beware  that  it  is  easy  to  over¬ 
interpret  findings  when  using  unpenalized  estimation,  and  confidence  inter¬ 
vals  are  too  narrow.  Let  us  use  the  bootstrap  to  study  the  uncertainty  in 
the  selection  of  variables  and  to  penalize  for  this  uncertainty  when  estimat¬ 
ing  predictive  performance  of  the  model.  The  variables  selected  in  the  first  20 
bootstrap  resamples  are  shown,  making  it  obvious  that  the  set  of  “significant” 
variables,  i.e.,  the  final  model,  is  somewhat  arbitrary. 
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Points 


0  10  20  30  40  50  60  70  80  90  100 

I ! 1 I I ! l_] L_f ! i I i I L_! I I ! I I ! I ! i  I  I  I 1 1 1 I  I  I ! I I I ! I 


Size  of  Primary  Tumor 
(cmA2) 

Serum  Prostatic  Acid 
Phosphatase 


[ — i — i — i — i — i — ! — i — i — i — i — i — i — i — i 

70  65  60  55  50  45  40  35  30  25  20  15  10  5  0 


50  10  5  1  0.5  0.1 


Age  in  Years  , - , - , - , - , - , - , - , - , - , 

45  50  55  i60  65  70  75  80  85  90 

History  of  Cardiovascular  _ , 

Disease 

0 


Total  Points 


i  '  !  !  «  i  !  !  !  1  i  1  •  1  »  i  '  •  »  !  i  1  T  '  '  i  T  •  !  1  i 

0  50  100  150  200  250  300 


Linear  Predictor 
Probability 


i — T — i — 1 — i — 1 — i — 1 — f — 1 — i — * — f — 1 — i — 11 — i — T — i 

-5-4-3-2-10  1  2  3  4 

f  “T- 1 - 1 - - T~~ 1 

0.01  0.050.1  0.25  0.5  0.75  0.90.95 


Fig.  11.4  Nomogram  calculating  X(3  and  P  for  cvd  as  the  cause  of  death,  using 
the  step-down  model.  For  each  predictor,  read  the  points  assigned  on  the  0-100  scale 
and  add  these  points.  Read  the  result  on  the  Total  Points  scale  and  then  read  the 
corresponding  predictions  below  it. 


Index  Original  Training  Test  Optimism  Corrected  n 
Sample  Sample  Sample  Index 


DXy 

0.682 

0.713 

0.643 

0.071 

0.611  200 

R2 

0.439 

0.481 

0.393 

0.088 

0.351  200 

Intercept 

0.000 

0.000 

-0.006 

0.006 

-0.006  200 

Slope 

1.000 

1.000 

0.811 

0.189 

0.811  200 

E 

^max 

0.000 

0.000 

0.048 

0.048 

0.048  200 

D 

0.395 

0.449 

0.346 

0.102 

0.293  200 

u 

-0.008 

-0.008 

0.018 

-0.026 

0.018  200 

Q 

0.403 

0.456 

0.329 

0.128 

0.275  200 

B 

0.162 

0.151 

0.174 

-0.022 

0.184  200 

9 

1.932 

2.213 

1.756 

0.457 

1.475  200 

9p 

0.341 

0.355 

0.320 

0.035 

0.306  200 
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Factors  Retained  in  Backwards  Elimination 
First  20  Resamples 

sz  sg  ap  sbp  dbp  age  wt  hg  ekg  pf  bm  hx  rx  dtime 


Frequencies  of  Numbers  of  Factors  Retained 


12  3  4  5  6789  11  12 
6  39  47  61  19  10  8  4  2  3  1 


The  slope  shrinkage  (7)  is  a  bit  lower  than  was  estimated  above.  There  is 
drop-off  in  all  indexes.  The  estimated  likely  future  predictive  discrimination 
of  the  model  as  measured  by  Somers’  Dxy  fell  from  0.682  to  0.611.  The 
latter  estimate  is  the  one  that  should  be  claimed  when  describing  model 
performance. 

A  nearly  unbiased  estimate  of  future  calibration  of  the  stepwise-derived 
model  is  given  below. 

cal  calibrate (f ,  B=200 ,  bw=TRUE) 
plot (cal)  #  Figure  11.5 

The  amount  of  overfitting  seen  in  Figure  11.5  is  consistent  with  the  indexes 
produced  by  the  validate  function. 

For  comparison,  consider  a  bootstrap  validation  of  the  full  model  without 
using  variable  selection. 

vfull  validate (f ,  B=200) 
latex (vfull,  digits=3) 


11.6  Backwards  Step-Down 


285 


B=  200  repetitions,  boot  Mean  absolute  error=0.028  n=257 

Fig.  11.5  Bootstrap  overfitting-corrected  calibration  curve  estimate  for  the  back¬ 
wards  step-down  cause  of  death  logistic  model,  along  with  a  rug  plot  showing  the  dis¬ 
tribution  of  predicted  risks.  The  smooth  nonparametric  calibration  estimator  (loess) 
is  used. 


Index  Original  Training  Test  Optimism  Corrected  n 
Sample  Sample  Sample  Index 


Dxy 

0.786 

0.833 

0.738 

0.095 

0.691  200 

R2 

0.573 

0.641 

0.501 

0.140 

0.433  200 

Intercept 

0.000 

0.000 

-0.013 

0.013 

-0.013  200 

Slope 

1.000 

1.000 

0.690 

0.310 

0.690  200 

E 

^max 

0.000 

0.000 

0.085 

0.085 

0.085  200 

D 

0.558 

0.653 

0.468 

0.185 

0.373  200 

u 

-0.008 

-0.008 

0.051 

-0.058 

0.051  200 

Q 

0.566 

0.661 

0.417 

0.244 

0.322  200 

B 

0.133 

0.115 

0.150 

-0.035 

0.168  200 

9 

2.688 

3.464 

2.355 

1.108 

1.579  200 

9p 

0.394 

0.416 

0.366 

0.050 

0.344  200 

Compared  to  the  validation  of  the  full  model,  the  step-down  model  has  less 
optimism,  but  it  started  with  a  smaller  Dxy  due  to  loss  of  information  from 
removing  moderately  important  variables.  The  improvement  in  optimism  was 
not  enough  to  offset  the  effect  of  eliminating  variables.  If  shrinkage  were  used 
with  the  full  model,  it  would  have  better  calibration  and  discrimination  than 
the  reduced  model,  since  shrinkage  does  not  diminish  Dxy.  Thus  stepwise 
variable  selection  failed  at  delivering  excellent  predictive  discrimination. 

Finally,  compare  previous  results  with  a  bootstrap  validation  of  a  step- 
down  model  using  a  better  significance  level  for  a  variable  to  stay  in  the 
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model  ( a  =  0.5,589)  and  using  individual  approximate  Wald  tests  rather 
than  tests  combining  all  deleted  variables. 


v5  V-  validate (f ,  bw  =  TRUE ,  sls=0 .5 ,  type= ' individual  '  ,  B=200) 


Deleted 

Chi-Sq 

d  .  f  . 

P 

Residual 

d  .  f  . 

P 

AIC 

ekg 

6 . 76 

5 

0 . 2391 

6 . 76 

5 

0 . 2391 

-3 . 24 

bm 

0 . 09 

1 

0 . 7639 

6 . 85 

6 

0 . 3349 

-5 . 15 

hg 

0 . 38 

1 

0 . 5378 

7 . 23 

7 

0 . 4053 

-6 . 77 

sbp 

0 . 48 

1 

0 . 4881 

7 .71 

8 

0 . 4622 

-8.29 

wt 

1.11 

1 

0 . 2932 

8 . 82 

9 

0 . 4544 

-9 . 18 

dt  ime 

1 . 47 

1 

0 . 2253 

10 . 29 

10 

0 .4158 

-9 .71 

rx 

5 . 65 

3 

0 . 1302 

15 . 93 

13 

0 . 2528 

-10 . 07 

Approximate  Estimates 

after 

Deleting 

Factors 

Backwards  Step-down  -  Original  Model 


Intercept 

sz 

sg 

ap 

dbp 

age 

pf=in  bed 
pf=in  bed 
hx 


<  50%  daytime 
>  50%  daytime 


■4 

■0 

■0 

■0 

0 

0 

0 

-2 

0 


Coef 

86308 

05063 

28038 

24838 

28288 

08502 

81151 

19885 

87834 


2 

0 

0 

0 

0 

0 

0 

1 

0 


S  .  E  . 
67292 
01581 
11014 
12369 
13036 
02690 
66376 
21212 
35203 


Wald  Z 


-1 

-3 

-2 

-2 

2 

3 

1 

-1 

2 


819 

202 

546 

008 

170 

161 

223 

814 

495 


0 

0 

0 

0 

0 

0 

0 

0 

0 


P 

068852 

001366 

010903 

044629 

030008 

001572 

221485 

069670 

012592 


Factors  in  Final  Model 


[1]  sz  sg  ap  dbp  age  pf  hx 


latex (v5  , 


digit s  =3  ,  B  =  0 ) 


Index  Original  Training  Test  Optimism  Corrected  n 
Sample  Sample  Sample  Index 


DXy 

0.739 

0.801 

0.716 

0.085 

0.654  200 

R2 

0.517 

0.598 

0.481 

0.117 

0.400  200 

Intercept 

0.000 

0.000 

-0.008 

0.008 

-0.008  200 

Slope 

1.000 

1.000 

0.745 

0.255 

0.745  200 

E 

^max 

0.000 

0.000 

0.067 

0.067 

0.067  200 

D 

0.486 

0.593 

0.444 

0.149 

0.337  200 

u 

-0.008 

-0.008 

0.033 

-0.040 

0.033  200 

Q 

0.494 

0.601 

0.411 

0.190 

0.304  200 

B 

0.147 

0.125 

0.156 

-0.030 

0.177  200 

9 

2.351 

2.958 

2.175 

0.784 

1.567  200 

9p 

0.372 

0.401 

0.358 

0.043 

0.330  200 

The  performance  statistics  are  midway  between  the  full  model  and  the 
smaller  stepwise  model. 
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11.7  Model  Approximation 

Frequently  a  better  approach  than  stepwise  variable  selection  is  to  approx¬ 
imate  the  full  model,  using  its  estimates  of  precision,  as  discussed  in  Sec¬ 
tion  5.5.  Stepwise  variable  selection  as  well  as  regression  trees  are  useful  for 
making  the  approximations,  and  the  sacrifice  in  predictive  accuracy  is  always 
apparent. 

We  begin  by  computing  the  “gold  standard”  linear  predictor  from  the  full 
model  fit  ( R 2  =  1.0),  then  running  backwards  step-down  OLS  regression  to 
approximate  it. 


lp 

predict(f)  #  Compute 

linea 

r  pr 

e  d 

i  ct 

o  r 

fr 

om  fu  l 

1 

L 

model 

# 

Ins 

ert  sigma=l  as  otherwis 

e  s  igm 

a  =0 

wi 

1 1 

c  a 

us  e 

probl 

e 

ms 

a 

ols  (lp  ~  sz  +  sg  +  log( 

ap)  + 

sbp 

+ 

dbp 

+ 

ag 

e  +  wt 

+ 

hg  +  ekg  +  pf  +  bm 

+  hx  + 

rx 

+ 

dt  ime 

,  s 

igma  =  1 

9 

dat a=psub ) 

# 

Sp  e 

cify  silly  stopping  cri 

ter  ion 

t  0 

re 

mov 

e 

all 

vari  a 

b 

l  es 

s 

fastbw (a,  aics=10000) 

be 

t  as 

V-  s$Coef f icients  # 

matrix 

,  ro 

ws 

=  it 

er 

ati 

ons 

X 

V-  cbind(l,  f$x)  # 

design 

mat 

ri 

X 

# 

Compute  the  series  of  appr 

oximat 

ions 

t 

o  lp 

ap 

X  °/0*°/o  t  (betas) 

# 

For 

each  approx,  compute  approxi 

mati 

on 

Ra 

2 

and 

ratio 

# 

lik 

elihood  ratio  chi-squar 

e  for 

appr 

ox 

ima 

t  e 

mo 

del  to 

that 

# 

of 

original  mode l 

m 

ncol(ap)  -  1  #  all  bu 

t  inte 

rcep 

t  - 

onl 

y 

mod 

el 

r2 

frac  V-  numeric (m) 

fu 

11c 

hisq  V-  f $  stats  [' Model 

L.R.  '] 

f  0 

r  (  i 

in  1 : m )  { 

lpa 

V-  ap  [ ,  i] 

r2  [ 

i]  V-  cor (lpa ,  lp)A2 

fapprox  V-  lrm  (cvd  ~  lpa, 

dat a=p 

sub  ) 

fra 

c [i]  V-  f approx$ stats [ ' 

Model 

L.R. 

'] 

/ 

f  u 

11c 

hisq 

} 

#  Figure  11.6: 

pi 

ot  ( 

r2 ,  frac  ,  type=  'b  '  , 

xlab=expression (paste ( ' 

Approx 

imat 

i  o 

n  ' 

Ra  2 

)) , 

ylab=expression (paste ( ' 

Fract i 

on  o 

f 

t 

chiA2,  '  Preserved' 

))) 

abl  in 

e(h=.95,  col =gray ( . 83 ) ) 

;  abl  i 

ne  (v 

• 

95  , 

c 

ol  = 

gray  (  . 

83)) 

abline(a=0,  b=l ,  col =gray ( . 83 ) ) 

After  6  deletions,  slightly  more  than  0.05  of  both  the  LR  y2  and  the  approx¬ 
imation  R2  are  lost  (see  Figure  11.6).  Therefore  we  take  as  our  approximate 
model  the  one  that  removed  6  predictors.  The  equation  for  this  model  is 
below,  and  its  nomogram  is  in  Figure  11.7. 

L 

fapprox  V-  ols (lp  ~  sz  +  sg  +  log(ap)  +  age  +  ekg  +  pf  +  hx  + 

rx ,  dat a=psub ) 

f approx$ stats [' R2 ' ]  #  as  a  check 

R2  0.9453396 

L 

latex ( f  approx  ,  f ile  =  '  '  ) 
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Fig.  11.6  Fraction  of  explainable  variation  (full  model  LR  y2)  in  cvd  that  was 
explained  by  approximate  models,  along  with  approximation  accuracy  (x— axis) 


E(lp)  =  X/3,  where 


xp  = 

-2.868303  -  0.06233241  sz  -  0.3157901  sg  -  0.3834479  log(ap)  +  0.09089393  age 
+  1. 396922 [bngn]  +  0.06275034[rd&ec]  -  1. 24892 [hbocd]  +  0.6511938[hrts] 
+0.3236771  [MI] 


+  1.1 16028  [in  bed  <  50%  daytime 


2. 436734 [in  bed  >  50%  daytime 


+  1.05316  hx 


— 0.3888534[0.2  mg  estrogen]  +  0.6920495[1.0  mg  estrogen] 
+0.7834498[5.0  mg  estrogen] 


and  [c]  =  1  if  subject  is  in  group  c,  0  otherwise. 


nom  nomogram ( f approx ,  ap=c(.l,  .5,  1,  5,  10,  20,  30,  40), 

f un=plogis ,  fun lab el =" Probability" , 
lp  .  at  =  (  -5  )  :  4  , 

fun.lp.at  =  qlogis  (  c  (  .  01  ,  .05  ,  .25  ,  .5,  .75  ,  .95  ,  .99))) 
plot (nom ,  xfrac=.45)  #  Figure  11.7 
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Fig.  11.7  Nomogram  for  predicting  the  probability  of  cvd  based  on  the  approximate 
model 


Chapter  12 

Logistic  Model  Case  Study  2:  Survival 
of  Titanic  Passengers 


This  case  study  demonstrates  the  development  of  a  binary  logistic  regression 
model  to  describe  patterns  of  survival  in  passengers  on  the  Titanic ,  based  on 
passenger  age,  sex,  ticket  class,  and  the  number  of  family  members  accom¬ 
panying  each  passenger.  Nonparametric  regression  is  also  used.  Since  many 
of  the  passengers  had  missing  ages,  multiple  imputation  is  used  so  that  the 
complete  information  on  the  other  variables  can  be  efficiently  utilized.  Titanic 
passenger  data  were  gathered  by  many  researchers.  Primary  references  are 
the  Encyclopedia  Titanica  at  www.encyclopedia-titanica.org  and  Eaton  and 
Haas.169  Titanic  survival  patterns  have  been  analyzed  previously151,296,571 
but  without  incorporation  of  individual  passenger  ages.  Thomas  Cason  while 
a  University  of  Virginia  student  compiled  and  interpreted  the  data  from  the 
World  Wide  Web.  One  thousand  three  hundred  nine  of  the  passengers  are 
represented  in  the  dataset,  which  is  available  from  this  text’s  Web  site  under 
the  name  titanic3.  An  early  analysis  of  Titanic  data  may  be  found  in  Bron  °. 


12.1  Descriptive  Statistics 

First  we  obtain  basic  descriptive  statistics  on  key  variables, 
require ( rms ) 


get Hdat a ( t i t ani c3 )  #  get  dataset  from  web  site 

#  List  of  names  of  variables  to  analyze 

v  c(  'pclass  '  ,  'survived  '  ,  'age  '  ,  'sex  '  ,  'sibsp  '  ,  'parch  ') 

t3  titanic3[,  v] 

unit s ( t3 $  age )  'years  ' 

latex ( describe (t3 ) ,  file='') 
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t3 

6  Variables  1309  Observations 


pclass 

n  missing  unique 
1309  0  3 

1st  (323,  25°/0),  2nd  (277,  21'/.),  3rd  (709,  54°/0) 


survived  :  Survived 

n  missing  unique  Info  Sum  Mean 

1309  0  2  0.71  500  0.382 


age  :  Age  [years]  . . .  illi Jill Jlll.llii.iill.iiL.i.. 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 
1046  263  98  1  29.88  5  14  21  28  39  50  57 

lowest  :  0.1667  0.3333  0.4167  0.6667  0.7500 
highest:  70.5000  71.0000  74.0000  76.0000  80.0000 


sex 

n  missing  unique 
1309  0  2 

female  (466,  36'/.),  male  (843,  64°/0) 


sibsp  :  Number  of  Siblings/Spouses  Aboard 

n  missing  unique  Info  Mean 
1309  0  7  0.67  0.4989 

0  1  2  3  4  5  8 

Frequency  891  319  42  20  22  6  9 

°/0  68  24  3  2  2  0  1 


parch  :  Number  of  Parents/Children  Aboard 

n  missing  unique  Info  Mean 

1309  0  8  0.55  0.385 

0  1  234569 

Frequency  1002  170  113  86622 

°/0  77  13  9  1  0  0  0  0 


Next,  we  obtain  access  to  the  needed  variables  and  observations,  and  save  data 
distribution  characteristics  for  plotting  and  for  computing  predictor  effects. 
There  are  not  many  passengers  having  more  than  3  siblings  or  spouses  or 
more  than  3  children,  so  we  truncate  two  variables  at  3  for  the  purpose  of 
estimating  stratified  survival  probabilities. 


dd 

V-  datadist (t3 

) 

L 

# 

describe  distri 

but  ions 

vari ab 

l 

es 

t  o  rms 

options (datadist= 

'dd  ') 

s 

<—  summary (surv 

ived  ~  age 

+  sex 

+ 

P 

class  + 

cut  2 

(sibsp  ,  0 

:  3) 

+  cut 

2 

Cp 

arch ,0:3) ,  data=t3) 

pi 

ot ( s ,  main  =  '  '  , 

subtitle 

s  =FALSE ) 

# 

Figure  12.1 

Note  the  large  number  of  missing  ages.  Also  note  the  strong  effects  of  sex  and 
passenger  class  on  the  probability  of  surviving.  The  age  effect  does  not  appear 
to  be  very  strong,  because  as  we  show  later,  much  of  the  effect  is  restricted  to 
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Fig.  12.1  Uni  variable  summaries  of  Titanic  survival 


age  <21  years  for  one  of  the  sexes.  The  effects  of  the  last  two  variables  are 
unclear  as  the  estimated  proportions  are  not  monotonic  in  the  values  of  these 
descriptors.  Although  some  of  the  cell  sizes  are  small,  we  can  show  four- way 
empirical  relationships  with  the  fraction  of  surviving  passengers  by  creating 
four  cells  for  sibsp  x  parch  combinations  and  by  creating  two  age  groups.  We 
suppress  proportions  based  on  fewer  than  25  passengers  in  a  cell.  Results  are 
shown  in  Figure  12.2. 
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/ child  '  , 

'  par  / 

child 

')) 

g 

V-  function(y)  if(l 

ength (y)  < 
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NA  else 

mean ( y ) 
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V-  with(tn,  summari 
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Hmis  c 

pack 

age 

# 

Figure  12.2: 

ggplot  (subset  (s  ,  agec 

!  =  'NA 

') . 

aes (x=survived ,  y  =  p 

class  , 

shap 

e  =  s 

ex  )  )  + 

geom_point ()  +  face 

t _gr id  ( 

agec 

rsj 

sibsp  * 

parch ) 

+ 

xlab (' Proport i on  Surviving 

')  + 

ylab ( 'Pass 

enger 

Class 

')  + 

scale_x_cont inuous  ( 

breaks  = 

c  (0  , 

.5 

,  l)) 
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Fig.  12.2  Multi-way  summary  of  Titanic  survival 
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Note  that  none  of  the  effects  of  sibsp  or  parch  for  common  passenger  groups 
appear  strong  on  an  absolute  risk  scale. 


12.2  Exploring  Trends  with  Nonparametric  Regression 

As  described  in  Section  2.4.7,  the  loess  smoother  has  excellent  performance 
when  the  response  is  binary,  as  long  as  outlier  detection  is  turned  off.  Here 
we  use  a  ggplot2  add-on  function  histSpikeg  in  the  Hmisc  package  to  obtain 
and  plot  the  loess  fit  and  age  distribution.  histSpikeg  uses  the  “no  iteration” 
option  for  the  R  lowess  function  when  the  response  is  binary. 


# 

Figure  12.3 

L 

b 

scale_size_discrete (range= 

c( .1 ,  .85)) 

yi 

ylab (NULL) 

pi 

ggplot  (t3  ,  aes (x  =  age  , 

y=survived))  + 

histSpikeg ( survived  ~ 

ylim (0 , 1)  +  yl 
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lowess=TRUE,  data=t3)  + 

p2 

e- 

ggplot  (t3  ,  aes (x  =  age  , 

y=survived,  color=sex))  + 
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data=t3)  + 

ylim 
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y  =  survived  ,  s ize =pclas s  )  )  + 
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age 

+  pclass  ,  lowess=TRUE, 

data=t3)  + 

b  + 

ylim (0 , 1)  +  yl 

p4 

ggplot  (t3  ,  aes (x  =  age  , 

y=survived,  color=sex, 

size  =pclass  )  )  + 

histSpikeg ( survived  ~ 

age 

+  sex  +  pclass  , 

lowess  =TRUE 

,  data=t3)  + 

b  +  ylim  (0 , 1)  +  yl 

gridExtra  :  : grid. arrange (pi  , 

p2  , 

p3  ,  p4  ,  ncol=2)  #  combine  4- 
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age 


0  20  40  60  80 

age 


sex 

—  female 
male 


i  i  i  i  i 

0  20  40  60  80 


pclass 

—  1st 

—  2nd 

—  3rd 

sex 

—  female 
male 


age 


Fig.  12.3  Nonparametric  regression  (loess)  estimates  of  the  relationship  between 
age  and  the  probability  of  surviving  the  Titanic,  with  tick  marks  depicting  the  age 
distribution.  The  top  left  panel  shows  unstratified  estimates  of  the  probability  of 
survival.  Other  panels  show  nonparametric  estimates  by  various  strat ideations. 


Figure  12.3  shows  much  of  the  story  of  passenger  survival  patterns.  “Women 
and  children  first”  seems  to  be  true  except  for  women  in  third  class.  It  is 
interesting  that  there  is  no  real  cutoff  for  who  is  considered  a  child.  For  men, 
the  younger  the  greater  chance  of  surviving.  The  interpretation  of  the  effects 
of  the  “number  of  relatives”- type  variables  will  be  more  difficult,  as  their 
definitions  are  a  function  of  age.  Figure  12.4  shows  these  relationships. 

#  Figure  12.4 

top  V-  theme ( legend . pos it  ion =' top  '  ) 

pi  V-  ggplot  (t3 ,  aes (x  =  age  ,  y=survived,  color  =  cut2 ( sibsp  , 
0:2)))  +  stat_plsmo ()  +  b  +  ylim(0,l)  +  yl  +  top  + 

scale_color_discrete  (name= 'siblings /spouses  ') 
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p2  V-  ggplot  (t3  ,  aes  (x  =  age  ,  y=survived,  color  =  cut 2  ( parch  , 
0:2)))  +  stat.plsmo ()  +  b  +  ylim(0,l)  +  yl  +  top  + 

scale_color_discrete  (name= ' par ent s / chi ldr en  ') 
gridExtra :: grid . arrange (pi ,  p2 ,  ncol=2) 


siblings/spouses  -  0  -  1  [2,8] 


parents/children 


[2,9] 


0  20  40  60  80  0  20  40  60  80 

age  age 

Fig.  12.4  Relationship  between  age  and  survival  stratified  by  the  number  of  siblings 
or  spouses  on  board  (left  panel)  or  by  the  number  of  parents  or  children  of  the 
passenger  on  board  (right  panel). 


12.3  Binary  Logistic  Model  With  Casewise  Deletion 
of  Missing  Values 

What  follows  is  the  standard  analysis  based  on  eliminating  observations  hav¬ 
ing  any  missing  data.  We  develop  an  initial  somewhat  saturated  logistic 
model,  allowing  for  a  flexible  nonlinear  age  effect  that  can  differ  in  shape 
for  all  six  sex  x  class  strata.  The  sibsp  and  parch  variables  do  not  have  suf¬ 
ficiently  dispersed  distributions  to  allow  for  us  to  model  them  nonlinearly. 
Also,  there  are  too  few  passengers  with  nonzero  values  of  these  two  variables 
in  sex  x  pclass  x  age  strata  to  allow  us  to  model  complex  interactions  in¬ 
volving  them.  The  meaning  of  these  variables  does  depend  on  the  passenger’s 
age,  so  we  consider  only  age  interactions  involving  sibsp  and  parch. 

L 

fl  V-  lrm  (survived  sex*pclass  *rcs  (age  ,  5)  + 

res (age  ,  5) *(  sibsp  +  parch),  data  =  t3)  #  Table  12.1 

latex (anova(fl)  ,  f  ile  =  '  '  ,  label =  1  titanic-anova3  '  , 
s ize  = ' small  '  ) 

Three-way  interactions  are  clearly  insignificant  (P  =  0.4)  in  Table  12.1.  So 
is  parch  (P  =  0.6  for  testing  the  combined  main  effect  +  interaction  effects 
for  parch,  i.e.,  whether  parch  is  important  for  any  age).  These  effects  would 
be  deleted  in  almost  all  bootstrap  resamples  had  we  bootstrapped  a  variable 
selection  procedure  using  a  =  0.1  for  retention  of  terms,  so  we  can  safely 
ignore  these  terms  for  future  steps.  The  model  not  containing  those  terms 
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Table  12.1  Wald  Statistics  for  survived 


X2 

d.f. 

P 

sex  (Factor+Higher  Order  Factors) 

187.15 

15  <  0.0001 

All  Interactions 

59.74 

14  <  0.0001 

pclass  (Factor+Higher  Order  Factors) 

100.10 

20  <  0.0001 

All  Interactions 

46.51 

18 

0.0003 

age  (Factor+Higher  Order  Factors) 

56.20 

32 

0.0052 

All  Interactions 

34.57 

28 

0.1826 

Nonlinear  (Factor+Higher  Order  Factors) 

28.66 

24 

0.2331 

sibsp  (Factor+Higher  Order  Factors) 

19.67 

5 

0.0014 

All  Interactions 

12.13 

4 

0.0164 

parch  (Factor+Higher  Order  Factors) 

3.51 

5 

0.6217 

All  Interactions 

3.51 

4 

0.4761 

sex  x  pclass  (Factor+Higher  Order  Factors) 

42.43 

10  <  0.0001 

sex  x  age  (Factor+Higher  Order  Factors) 

15.89 

12 

0.1962 

Nonlinear  (Factor+Higher  Order  Factors) 

14.47 

9 

0.1066 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

4.17 

3 

0.2441 

pclass  x  age  (Factor+Higher  Order  Factors) 

13.47 

16 

0.6385 

Nonlinear  (Factor+Higher  Order  Factors) 

12.92 

12 

0.3749 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

6.88 

6 

0.3324 

age  x  sibsp  (Factor+Higher  Order  Factors) 

12.13 

4 

0.0164 

Nonlinear 

1.76 

3 

0.6235 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

1.76 

3 

0.6235 

age  x  parch  (Factor+Higher  Order  Factors) 

3.51 

4 

0.4761 

Nonlinear 

1.80 

3 

0.6147 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

1.80 

3 

0.6147 

sex  x  pclass  x  age  (Factor+Higher  Order  Factors) 

8.34 

8 

0.4006 

Nonlinear 

7.74 

6 

0.2581 

TOTAL  NONLINEAR 

28.66 

24 

0.2331 

TOTAL  INTERACTION 

75.61 

30  <  0.0001 

TOTAL  NONLINEAR  +  INTERACTION 

79.49 

33  <  0.0001 

TOTAL 

241.93 

39  <  0.0001 

is  fitted  below.  The  ~2  in  the  model  formula  means  to  expand  the  terms  in 
parentheses  to  include  all  main  effects  and  second-order  interactions. 

L 

f  lrm ( survived  ~  (sex  +  pclass  +  res ( age  ,  5) ) A2  + 

res ( age  , 5) * sibsp  ,  data  =  t3) 
print  (f  ,  latex  =  TRUE) 


Logistic  Regression  Model 

lrm(formula  =  survived  ~  (sex  +  pclass  +  res (age,  5)) ~2 
+  res (age,  5)  *  sibsp,  data  =  t3) 


Frequencies  of  Missing  Values  Due  to  Each  Variable 


survived 

0 


sex  pclass 
0  0 


age 

263 


sibsp 

0 
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Model  Likelihood 
Ratio  Test 

Discrimination 

Indexes 

Rank  Discrim. 
Indexes 

Obs  1046 

0  619 

1  427 

max  L  6xl0-6 

LR  x2  553.87 

d.f.  26 

Pr(>  x2)  <  0.0001 

U2  0.555 

g  2.427 

gr  11.325 

gp  0.365 

Brier  0.130 

C  0.878 

Dxy  0.756 

7  0.758 

ra  0.366 

Coef  S.E.  Wald  Z  Pr(>  \Z\) 


Intercept 

3.3075 

1.8427 

1.79 

0.0727 

sex=male 

-1.1478 

1.0878 

-1.06 

0.2914 

pclass=2nd 

6.7309 

3.9617 

1.70 

0.0893 

pclass=3rd 

-1.6437 

1.8299 

-0.90 

0.3691 

age 

0.0886 

0.1346 

0.66 

0.5102 

age’ 

-0.7410 

0.6513 

-1.14 

0.2552 

age” 

4.9264 

4.0047 

1.23 

0.2186 

age”’ 

-6.6129 

5.4100 

-1.22 

0.2216 

sibsp 

-1.0446 

0.3441 

-3.04 

0.0024 

sex=male  *  pclass=2nd 

-0.7682 

0.7083 

-1.08 

0.2781 

sex=male  *  pclass=3rd 

2.1520 

0.6214 

3.46 

0.0005 

sex=male  *  age 

-0.2191 

0.0722 

-3.04 

0.0024 

sex=male  *  age’ 

1.0842 

0.3886 

2.79 

0.0053 

sex=male  *  age” 

-6.5578 

2.6511 

-2.47 

0.0134 

sex=male  *  age”’ 

8.3716 

3.8532 

2.17 

0.0298 

pclass=2nd  *  age 

-0.5446 

0.2653 

-2.05 

0.0401 

pclass=3rd  *  age 

-0.1634 

0.1308 

-1.25 

0.2118 

pclass=2nd  *  age’ 

1.9156 

1.0189 

1.88 

0.0601 

pclass=3rd  *  age’ 

0.8205 

0.6091 

1.35 

0.1780 

pclass=2nd  *  age” 

-8.9545 

5.5027 

-1.63 

0.1037 

pclass=3rd  *  age” 

-5.4276 

3.6475 

-1.49 

0.1367 

pclass=2nd  *  age”’ 

9.3926 

6.9559 

1.35 

0.1769 

pclass=3rd  *  age”’ 

7.5403 

4.8519 

1.55 

0.1202 

age  *  sibsp 

0.0357 

0.0340 

1.05 

0.2933 

age’  *  sibsp 

-0.0467 

0.2213 

-0.21 

0.8330 

age”  *  sibsp 

0.5574 

1.6680 

0.33 

0.7382 

age’”  *  sibsp 

-1.1937 

2.5711 

-0.46 

0.6425 

latex ( an ova(f)  , f ile  =  '  '  , label =  1  titanic-anova2  '  ,size  =  'small  ') 
#12.2 

This  is  a  very  powerful  model  (ROC  area  =  c  =  0.88);  the  survival  patterns 
are  easy  to  detect.  The  Wald  ANOVA  in  Table  12.2  indicates  especially  strong 
sex  and  pclass  effects  (y2  =  199  and  109,  respectively).  There  is  a  very  strong 
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Table  12.2  Wald  Statistics  for  survived 


X2  1 

d.f. 

P 

sex  (Factor+Higher  Order  Factors) 

199.42 

7 

< 

0.0001 

All  Interactions 

56.14 

6 

< 

0.0001 

pclass  (Factor+Higher  Order  Factors) 

108.73 

12 

< 

0.0001 

All  Interactions 

42.83 

10 

< 

0.0001 

age  (Factor+Higher  Order  Factors) 

47.04 

20 

0.0006 

All  Interactions 

24.51 

16 

0.0789 

Nonlinear  (Factor+Higher  Order  Factors) 

22.72 

15 

0.0902 

sibsp  (Factor+Higher  Order  Factors) 

19.95 

5 

0.0013 

All  Interactions 

10.99 

4 

0.0267 

sex  x  pclass  (Factor+Higher  Order  Factors) 

35.40 

2 

< 

0.0001 

sex  x  age  (Factor+Higher  Order  Factors) 

10.08 

4 

0.0391 

Nonlinear 

8.17 

3 

0.0426 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

8.17 

3 

0.0426 

pclass  x  age  (Factor+Higher  Order  Factors) 

6.86 

8 

0.5516 

Nonlinear 

6.11 

6 

0.4113 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

6.11 

6 

0.4113 

age  x  sibsp  (Factor+Higher  Order  Factors) 

10.99 

4 

0.0267 

Nonlinear 

1.81 

3 

0.6134 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

1.81 

3 

0.6134 

TOTAL  NONLINEAR 

22.72 

15 

0.0902 

TOTAL  INTERACTION 

67.58 

18 

< 

0.0001 

TOTAL  NONLINEAR  +  INTERACTION 

70.68 

21 

< 

0.0001 

TOTAL 

253.18 

26 

< 

0.0001 

sex  x  pclass  interaction  and  a  strong  age  x  sibsp  interaction,  considering 
the  strength  of  sibsp  overall. 

Let  us  examine  the  shapes  of  predictor  effects.  With  so  many  interactions 
in  the  model  we  need  to  obtain  predicted  values  at  least  for  all  combinations 
of  sex  and  pclass.  For  sibsp  we  consider  only  two  of  its  possible  values. 

L 

p  V-  Predict (f ,  age,  sex,  pclass,  sibsp=0,  fun=plogis ) 
ggplot  (p)  #  Fig.  12.5 


Note  the  agreement  between  the  lower  right-hand  panel  of  Figure  12.3  with 
Figure  12.5.  This  results  from  our  use  of  similar  flexibility  in  the  parametric 
and  nonparametric  approaches  (and  similar  effective  degrees  of  freedom).  The 
estimated  effect  of  sibsp  as  a  function  of  age  is  shown  in  Figure  12.6. 


ggplot  ( Predi ct  (f  ,  sibsp,  age  =  c ( 10 , 15 , 20 , 50)  ,  conf . int  =  FALSE ) ) 
##  Figure  12.6 


Note  that  children  having  many  siblings  apparently  had  lower  survival.  Mar¬ 
ried  adults  had  slightly  higher  survival  than  unmarried  ones. 

There  will  never  be  another  Titanic,  so  we  do  not  need  to  validate  the 
model  for  prospective  use.  But  we  use  the  bootstrap  to  validate  the  model 
anyway,  in  an  effort  to  detect  whether  it  is  overfitting  the  data.  We  do  not 
penalize  the  calculations  that  follow  for  having  examined  the  effect  of  parch  or 
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0  20  40  60  0  20  40  60  0  20  40  60 

Age,  years 


sex 

—  female 

—  male 


Fig.  12.5  Effects  of  predictors  on  probability  of  survival  of  Titanic  passengers,  esti¬ 
mated  for  zero  siblings  or  spouses 
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Fig.  12.6  Effect  of  number  of  siblings  and  spouses  on  the  log  odds  of  surviving,  for 
third  class  males 


for  testing  three-way  interactions,  in  the  belief  that  these  tests  would  replicate 

well. 

f  update (f ,  x=TRUE ,  y=TRUE) 

#  x=TRUE ,  y=TRUE  adds  raw  data  to  fit  object  so  can  bootstrap 
set . seed  ( 131 )  #  so  can  replicate  re-samples 

latex (validate (f,  B  =  200),  digits=2,  size  =  '  Ssize  1  ) 
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Index  Original  Training  Test  Optimism  Corrected  n 
Sample  Sample  Sample  Index 


DXy 

0.76 

0.77 

0.74 

0.03 

0.72  200 

R2 

0.55 

0.58 

0.53 

0.05 

0.50  200 

Intercept 

0.00 

0.00 

-0.08 

0.08 

-0.08  200 

Slope 

1.00 

1.00 

0.87 

0.13 

0.87  200 

-^max 

0.00 

0.00 

0.05 

0.05 

0.05  200 

D 

0.53 

0.56 

0.50 

0.06 

0.46  200 

u 

0.00 

0.00 

0.01 

-0.01 

0.01  200 

Q 

0.53 

0.56 

0.49 

0.07 

0.46  200 

B 

0.13 

0.13 

0.13 

-0.01 

0.14  200 

9 

2.43 

2.75 

2.37 

0.37 

2.05  200 

9p 

0.37 

0.37 

0.35 

0.02 

0.35  200 

cal  V-  calibrate (f ,  B=200) 
plot (cal,  subt it les =FALSE ) 

# 

Figure  12.7 

n=1046  Mean  absolute  error=0.009 

Mean 

squared  err or =0 . 000 1 2 

0.9  Quantile  of  absolute  error=0.017 

The  output  of  validate  indicates  minor  overfitting.  Overfitting  would  have 
been  worse  had  the  risk  factors  not  been  so  strong.  The  closeness  of  the  cali¬ 
bration  curve  to  the  45°  line  in  Figure  12.7  demonstrates  excellent  validation 
on  an  absolute  probability  scale.  But  the  extent  of  missing  data  casts  some 
doubt  on  the  validity  of  this  model,  and  on  the  efficiency  of  its  parameter 
estimates. 


Fig.  12.7  Bootstrap  overfitting-corrected  loess  nonparametric  calibration  curve  for 
casewise  deletion  model 
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12.4  Examining  Missing  Data  Patterns 

The  first  step  to  dealing  with  missing  data  is  understanding  the  patterns 
of  missing  values.  To  do  this  we  use  the  Hmisc  library’s  naclus  and  naplot 
functions,  and  the  recursive  partitioning  library  of  Atkinson  and  Therneau. 
Below  naclus  tells  us  which  variables  tend  to  be  missing  on  the  same  persons, 
and  it  computes  the  proportion  of  missing  values  for  each  variable.  The  rpart 
function  derives  a  tree  to  predict  which  types  of  passengers  tended  to  have 
age  missing. 

L 

na. patterns  V-  naclus  (  t  it  ani  c3  ) 

r equire ( rpart )  #  Recursive  partitioning  package 


who . na 

V-  rpart (is. na 

(ag 

e)  ~  sex  +  pclass  + 

L 

survived  + 

sibsp 

+ 

parch ,  data=t itanic3 

,  minbucket =15) 

naplot  ( 

na . patterns  ,  ' 

na 

per  var  '  ) 

plot (who. na,  margin=. 

l) ; 

text(who.na)  #  Figu 

re  12.8 

plot ( na 

. patterns ) 

We  see  in  Figure  12.8  that  age  tends  to  be  missing  on  the  same  passengers 
as  the  body  bag  identifier,  and  that  it  is  missing  in  only  0.09  of  first  or  sec¬ 
ond  class  passengers.  The  category  of  passengers  having  the  highest  fraction 
of  missing  ages  is  third  class  passengers  having  no  parents  or  children  on 
board.  Below  we  use  Hmisc’s  summary. formula  function  to  plot  simple  descrip¬ 
tive  statistics  on  the  fraction  of  missing  ages,  stratified  by  other  variables.  We 
see  that  without  adjusting  for  other  variables,  age  is  slightly  more  missing  on 
nonsurviving  passengers. 

L 

plot  (  summary  (  i s  .  na  (  age  )  ~  sex  +  pclass  +  survived  + 

sibsp  +  parch,  data=t3))  #  Figure  12.9 

Let  us  derive  a  logistic  model  to  predict  missingness  of  age,  to  see  if  the 
survival  bias  maintains  after  adjustment  for  the  other  variables. 

L 

m  V-  lrm ( i s . na ( age )  ~  sex  *  pclass  +  survived  +  sibsp  +  parch, 
dat  a  =  t 3 ) 

print  (m,  latex  =  TRUE ,  needspace = ' 2  in ' ) 


Logistic  Regression  Model 

lrm(formula  =  is.na(age)  ~  sex  *  pclass  +  survived  +  sibsp  + 
parch,  data  =  t3) 


Model  Likelihood 
Ratio  Test 

Discrimination 

Indexes 

Rank  Discrim. 
Indexes 

Obs  1309 

FALSE  1046 

TRUE  263 

max  91°|L  5  x  10-6 

LR  v2  114.99 

d.f.  8 

Pr(>  x2)  <  0.0001 

H2  0.133 

g  1.015 

gr  2.759 

gp  0.126 

Brier  0.148 

C  0.703 

Dxy  0.406 

7  0.452 

Ta  0.131 
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Fraction  of  NAs  in  each  Variable 


Fraction  of  NAs 
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Fig.  12.8  Patterns  of  missing  data.  Upper  left  panel  shows  the  fraction  of  observa¬ 
tions  missing  on  each  predictor.  Lower  panel  depicts  a  hierarchical  cluster  analysis  of 
missingness  combinations.  The  similarity  measure  shown  on  the  T-axis  is  the  frac¬ 
tion  of  observations  for  which  both  variables  are  missing.  Right  panel  shows  the  result 
of  recursive  partitioning  for  predicting  is.na(age).  The  rpart  function  found  only 
strong  patterns  according  to  passenger  class. 


Coef  S.E.  Wald  Z  Pr(>  \Z\) 


Intercept 

-2.2030 

0.3641 

-6.05 

<  0.0001 

sex=male 

0.6440 

0.3953 

1.63 

0.1033 

pclass=2nd 

-1.0079 

0.6658 

-1.51 

0.1300 

pclass=3rd 

1.6124 

0.3596 

4.48 

<  0.0001 

survived 

-0.1806 

0.1828 

-0.99 

0.3232 

sibsp 

0.0435 

0.0737 

0.59 

0.5548 

parch 

-0.3526 

0.1253 

-2.81 

0.0049 

sex=male  * 

pclass=2nd  0.1347 

0.7545 

0.18 

0.8583 

sex=male  * 

pclass=3rd  -0.8563 

0.4214 

-2.03 

0.0422 

latex (anova(m)  ,  file  =  '  '  ,  label  =  'titanic-anova.na  ') 
#  Table  12.3 
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mean 


sex 

female 

male 

pclass 

1st 

2nd 

3rd 

Survived 

No 

Yes 

Number  of  Siblings/Spouses  Aboard 

1 

2 

3 

4 

5 
8 

Number  of  Parents/Children  Aboard 

o 

1 

2 

3 

4 

5 

6 
9 

Overall 


0.0  0.2  0.4  0.6  0.8  1.0 


N 

466 

843 


323 

277 

709 


809 

500 


891 

319 

42 

20 

22 

6 

9 


1002 

170 

113 

8 

6 

6 

2 

2 


1309 


is.na(age) 


Fig.  12.9  Uni  variable  descriptions  of  proportion  of  passengers  with  missing  age 


Fortunately,  after  controlling  for  other  variables,  Table  12.3  provides  evi¬ 
dence  that  nonsurviving  passengers  are  no  more  likely  to  have  age  missing. 
The  only  important  predictors  of  missingness  are  pclass  and  parch  (the  more 
parents  or  children  the  passenger  has  on  board,  the  less  likely  age  was  to  be 
missing) . 


12.5  Multiple  Imputation 

Multiple  imputation  is  expected  to  reduce  bias  in  estimates  as  well  as  to 

/\ 

provide  an  estimate  of  the  variance-covariance  matrix  of  (3  penalized  for  im¬ 
putation.  With  multiple  imputation,  survival  status  can  be  used  to  impute 
missing  ages,  so  the  age  relationship  will  not  be  as  attenuated  as  with  single 
conditional  mean  imputation,  areglmpute  The  following  uses  the  Hmisc  pack¬ 
age  areglmpute  function  to  do  predictive  mean  matching,  using  van  Buuren’s 
“Type  1”  matching  [85,  Section  3.4.2]  in  conjunction  with  bootstrapping  to 
incorporate  all  uncertainties,  in  the  context  of  smooth  additive  imputation 
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Table  12.3  Wald  Statistics  for  is.na(age) 


X2  d.f. 

P 

sex  (Factor+Higher  Order  Factors) 

5.61 

3 

0.1324 

All  Interactions 

5.58 

2 

0.0614 

pclass  (Factor+Higher  Order  Factors) 

68.43 

4 

<  0.0001 

All  Interactions 

5.58 

2 

0.0614 

survived 

0.98 

1 

0.3232 

sibsp 

0.35 

1 

0.5548 

parch 

7.92 

1 

0.0049 

sex  x  pclass  (Factor+Higher  Order  Factors) 

5.58 

2 

0.0614 

TOTAL 

82.90 

8 

<  0.0001 

models.  Sampling  of  donors  is  handled  by  distance  weighting  to  yield  better 
distributions  of  imputed  values.  By  default,  areglmpute  does  not  transform 
age  when  it  is  being  predicted  from  the  other  variables.  Four  knots  are  used 
to  transform  age  when  used  to  impute  other  variables  (not  needed  here  as  no 
other  missings  were  present  in  the  variables  of  interest).  Since  the  fraction  of 
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#  Print  the  first  10  imputations  for  the  first  10  passengers 

#  having  missing  age 

mi  $ imput ed $  age  [  1 : 1 0 ,  1:10] 


[,1] 

[  ,  2] 

[  ,3] 

C  ,4] 

[, 

5] 

[  ,  6] 

C.7] 

[  ,  8] 

[  ,  9] 

[,10] 

16 

40 

49 

24 

29 

60 

.  0 

58 

64 

36 

50 

61 

38 

33 

45 

40 

49 

80 

.  0 

2 

38 

38 

36 

53 

41 

29 

24 

19 

31 

40 

.  0 

60 

64 

42 

30 

65 

47 

40 

42 

29 

48 

36 

.  0 

46 

64 

30 

38 

42 

60 

52 

40 

22 

31 

38 

.  0 

22 

19 

24 

40 

33 

70 

16 

14 

23 

23 

18 

.  0 

24 

19 

27 

59 

23 

71 

30 

62 

57 

30 

42 

.  0 

31 

64 

40 

40 

63 

75 

43 

23 

36 

61 

45 

.  5 

58 

64 

27 

24 

50 

81 

44 

57 

47 

31 

45 

.  0 

30 

64 

62 

39 

67 

107 

52 

18 

24 

62 

32 

.  5 

38 

64 

47 

19 

23 

plot (mi ) 

Ecdf (t3$age  ,  add  =  TRUE  ,  col='gray',  lwd  =  2, 
subt it les =FALSE ) #Fig .  12.10 


Fig.  12.10  Distributions  of  imputed  and  actual  ages  for  the  Titanic  dataset.  Imputed 
values  are  in  black  and  actual  ages  in  gray. 


We  now  fit  logistic  models  for  five  completed  datasets.  The  fit. mult. impute 
function  fits  five  models  and  examines  the  within-  and  between-imputation 
variances  to  compute  an  imputation-corrected  variance-covariance  matrix 

that  is  stored  in  the  fit  object  f  .mi.  fit  .mult .  impute  will  also  average  the  five 

/\ 

(3  vectors,  storing  the  result  in  f  .mi$coeff icients.  The  function  also  prints 
the  ratio  of  imputation-corrected  variances  to  average  ordinary  variances. 

L 

f.mi  V-  f  it  .  mult  .  imput  e  ( 

survived  ~  (sex  +  pclass  +  res ( age  ,  5) ) A2  + 
res ( age  ,5) *  sibsp  , 
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Table  12.4  Wald  Statistics  for  survived 


X2  1 

d.f. 

P 

sex  (Factor+Higher  Order  Factors) 

240.42 

7 

< 

0.0001 

All  Interactions 

54.56 

6 

< 

0.0001 

pclass  (Factor+Higher  Order  Factors) 

114.21 

12 

< 

0.0001 

All  Interactions 

36.43 

10 

0.0001 

age  (Factor+Higher  Order  Factors) 

50.37 

20 

0.0002 

All  Interactions 

25.88 

16 

0.0557 

Nonlinear  (Factor+Higher  Order  Factors) 

24.21 

15 

0.0616 

sibsp  (Factor+Higher  Order  Factors) 

24.22 

5 

0.0002 

All  Interactions 

12.86 

4 

0.0120 

sex  x  pclass  (Factor+Higher  Order  Factors) 

30.99 

2 

< 

0.0001 

sex  x  age  (Factor+Higher  Order  Factors) 

11.38 

4 

0.0226 

Nonlinear 

8.15 

3 

0.0430 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

8.15 

3 

0.0430 

pclass  x  age  (Factor+Higher  Order  Factors) 

5.30 

8 

0.7246 

Nonlinear 

4.63 

6 

0.5918 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

4.63 

6 

0.5918 

age  x  sibsp  (Factor+Higher  Order  Factors) 

12.86 

4 

0.0120 

Nonlinear 

1.84 

3 

0.6058 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

1.84 

3 

0.6058 

TOTAL  NONLINEAR 

24.21 

15 

0.0616 

TOTAL  INTERACTION 

67.12 

18 

< 

0.0001 

TOTAL  NONLINEAR  +  INTERACTION 

70.99 

21 

< 

0.0001 

TOTAL 

298.78 

26 

< 

0.0001 

lrm ,  mi,  data=t3 ,  pr=FALSE) 
latex (anova(f.mi)  ,  file  =  '  '  ,  label  =  'titanic- an ova. mi  '  , 
size  =  '  small  ')  #  Table  12.4 

The  Wald  x2  for  age  is  reduced  by  accounting  for  imputation  but  is  in¬ 
creased  (by  a  lesser  amount)  by  using  patterns  of  association  with  survival 
status  to  impute  missing  age.  The  Wald  tests  are  all  adjusted  for  multiple  im¬ 
putation.  Now  examine  the  fitted  age  relationship  using  multiple  imputation 
vs.  casewise  deletion. 


pi 

V-  Predict  (f  , 

age 

,  pclass, 

sex 

,  sibsp= 

=0 

,  fun 

=plogi s 

L 

) 

p2 

V-  Pr  edi  ct  (  f  .  mi  , 

age 

,  pclass, 

sex 

,  sibsp= 

=0 

,  fun 

=plogi s 

) 

P 

V-  rbind (  '  Casewise 

De 

let  ion  '  =pl 

t 

9 

Multiple 

* 

Imput 

at i on  '  = 

p2  ) 

ggplot(p,  groups =' sex 

f 

9 

ylab=  '  Probabi 

lity  of 

S 

urvi  v 

ing  '  ) 

# 

Figure  12.11 

12.6  Summarizing  the  Fitted  Model 

In  this  section  we  depict  the  model  fitted  using  multiple  imputation,  by  com¬ 
puting  odds  ratios  and  by  showing  various  predicted  values.  For  age,  the  odds 
ratio  for  an  increase  from  1  year  old  to  30  years  old  is  computed,  instead  of 
the  default  odds  ratio  based  on  outer  quart iles  of  age.  The  estimated  odds 
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Age,  years 


sex 

—  female 

—  male 


Fig.  12.11  Predicted  probability  of  survival  for  males  from  fit  using  casewise  deletion 
again  (top)  and  multiple  random  draw  imputation  (bottom).  Both  sets  of  predictions 
are  for  sibsp=0. 


ratios  are  very  dependent  on  the  levels  of  interacting  factors,  so  Figure  12.12 
depicts  only  one  of  many  patterns. 


L 

#  Get  predicted  values  for  certain  types  of  passengers 

s  V-  summary ( f . mi , 

age  =  c  (1 , 30)  , 

sibsp  =0:1) 

#  override  default 

ranges  for  3 

vari ab l es 

plot(s,  log=TRUE , 

main=  '  '  ) 

#  Figure  12.12 

Now  compute  estimated  probabilities  of  survival  for  a  variety  of  settings  of 
the  predictors. 


12.6  Summarizing  the  Fitted  Model 
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age  -  30:1 
sibsp  -  1 :0 

sex  -  female:male 

pclass  -  1st:3rd 
pclass  -  2nd:3rd 


0.10  0.50  2.00  5.00 

I _ I  i  I _ I  I  I  I  I  Mill 

- 


A 


A 


Adjusted  to:sex=male  pclass=3rd  age=28  sibsp=0 

Fig.  12.12  Odds  ratios  for  some  predictor  settings 


phat  V-  pr edi  ct  (  f  .  mi  , 

combos  V- 

expand.grid(age=c(2,21,50)  ,  sex  =  levels  (t3$sex)  , 

pclass  =  levels  (t3$pclass)  , 
sibsp=0),  type =  '  f itt ed  '  ) 

#  Can  also  use  Pr edict  ( f . mi ,  ag e  =  c  (2 , 21 , 50) ,  sex ,  pclass , 

#  sibsp=0,  fun=p l og i s ) $ yha t 
options (digits  =1) 

data . frame ( combos ,  phat) 


age 

sex 

pclass 

sibsp 

phat 

1 

2 

f  emale 

1st 

0 

0 . 97 

2 

21 

f  emale 

1st 

0 

0 . 98 

3 

50 

f  emale 

1st 

0 

0 . 97 

4 

2 

male 

1st 

0 

0 . 88 

5 

21 

male 

1st 

0 

0 . 48 

6 

50 

male 

1st 

0 

0 . 27 

7 

2 

f  emale 

2nd 

0 

1 . 00 

8 

21 

f  emale 

2nd 

0 

0 . 90 

9 

50 

f  emale 

2nd 

0 

0 . 82 

10 

2 

male 

2nd 

0 

1 . 00 

11 

21 

male 

2nd 

0 

0 . 08 

12 

50 

male 

2nd 

0 

0 . 04 

13 

2 

f  emale 

3rd 

0 

0 . 85 

14 

21 

f  emale 

3rd 

0 

0 . 57 

15 

50 

f  emale 

3rd 

0 

0 . 37 

16 

2 

male 

3rd 

0 

0 .91 

17 

21 

male 

3rd 

0 

0 . 13 

18 

50 

male 

3rd 

0 

0 . 06 

options (digits  =  5) 

We  can  also  get  predicted  values  by  creating  an  R  function  that  will  evaluate 
the  model  on  demand. 
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pred. logit  V-  Funct i on ( f . mi ) 

#  Note:  if  don't  define  sibsp  to  pred. logit,  defaults  to  0 

#  normally  just  type  the  function  name  to  see  its  body 
1 at ex (pred. logit,  f ile  =  '  '  ,  type='  S input  size  = 'small', 

width. cutoff  =49) 


pred. logit  V-  function  (sex  =  ’’male”,  pclass  =  ’’3rd”  , 
age  =  28,  sibsp  =  0) 

{ 

3.2427671  -  0.95431809  *  (sex  =  ’’male”)  +  5.4086505  * 
(pclass  =  ’’2nd”)  —  1.3378623  *  (pclass  = 

’’3rd”)  +  0.091162649  *  age  -  0.00031204327  * 
pmax(  age  —  6,  0)A3  +  0.0021750413  *  pmax(age  —  I 

21,  0) A3  -  0.0027627032  *  pmax(age  -  27,  0)A3  + 
0.0009805137  *  pmax(age  -  36,  0)A3  -  8 .0808484e-05  * 
pmax(age  —  55.8  ,  0)A3  —  1.1567976  *  sibsp  + 

(sex  =  ’’male”)  *  (—0.46061284  *  (pclass  = 

’’2nd”)  +  2.0406523  *  (pclass  =  ’’3rd”))  + 

(sex  =  ’’male”)  *  (-0.22469066  *  age  +  0.00043708296  * 
pmax(age  —  6,  0)A3  —  0.0026505136  *  pmax(age  — 

21,  0) A3  +  0.0031201404  *  pmax(age  -  27, 

0)A3  —  0.00097923749  *  pmax(age  —  36, 

0)A3  +  7.2527708e  — 05  *  pmax(  age  -  55.8, 

0)A3)  +  (pclass  =  ’’2nd”)  *  (—0.46144083  * 
age  +  0.00070194849  *  pmax(age  —  6,  0)A3  — 

0.0034726662  *  pmax(age  -  21,  0)A3  +  0.0035255387  * 
pmax(  age  —  27,  0)A3  —  0.0007900891  *  pmax(age  —  I 

36,  0) A3  +  3.5268151e  — 05  *  pmax(age  -  55.8, 

0) A3)  +  (pclass  =  ’’3rd”)  *  (-0.17513289  * 
age  +  0.00035283358  *  pmax(age  —  6,  0)A3  — 

0.0023049372  *  pmax(age  -  21,  0)A3  +  0.0028978962  * 
pmax ( age  —  27,  0)A3  —  0.00105145  *  pmax ( age  — 

36,  0) A3  +  0.00010565735  *  pmax( age  -  55.8, 

0) A3)  +  sibsp  *  (0.040830773  *  age  -  1 .5627772e-05  * 
pmax(  age  —  6,  0)A3  +  0.00012790256  *  pmax(age  —  I 

21,  0) A3  -  0.00025039385  *  pmax(age  -  27, 

0) A3  +  0.00017871701  *  pmax( age  -  36,  0)A3  - 
4.0597949e— 05  *  pmax(age  —  55.8,  0)A3) 

} 


#  Run  the  newly  created  function 

plogis  (pred . logit (age  =  c (2 , 21 , 50)  ,  sex= 'male  '  ,  pclass =' 3rd ') ) 


[1]  0.914817  0.132640  0.056248 

A  nomogram  could  be  used  to  obtain  predicted  values  manually,  but  this  is 
not  feasible  when  so  many  interaction  terms  are  present. 


Chapter  13 

Ordinal  Logistic  Regression 


13.1  Background 


Many  medical  and  epidemiologic  studies  incorporate  an  ordinal  response 
variable.  In  some  cases  an  ordinal  response  Y  represents  levels  of  a  standard 
measurement  scale  such  as  severity  of  pain  (none,  mild,  moderate,  severe). 
In  other  cases,  ordinal  responses  are  constructed  by  specifying  a  hierarchy 
of  separate  endpoints.  For  example,  clinicians  may  specify  an  ordering  of 
the  severity  of  several  component  events  and  assign  patients  to  the  worst 
event  present  from  among  none,  heart  attack,  disabling  stroke,  and  death. 
Still  another  use  of  ordinal  response  methods  is  the  application  of  rank-based 
methods  to  continuous  responses  so  as  to  obtain  robust  inferences.  For  ex¬ 
ample,  the  proportional  odds  model  described  later  allows  for  a  continuous 
Y  and  is  really  a  generalization  of  the  Wilcoxon-Mann-Whitney  rank  test. 
Thus  the  semiparametric  proportional  odds  model  is  a  direct  competitor  of 
ordinary  linear  models. 

There  are  many  variations  of  logistic  models  used  for  predicting  an  ordinal 
response  variable  Y.  All  of  them  have  the  advantage  that  they  do  not  assume 
a  spacing  between  levels  of  Y.  In  other  words,  the  same  regression  coefficients 
and  P- values  result  from  an  analysis  of  a  response  variable  having  levels  0, 1,  2 
when  the  levels  are  recoded  0, 1,  20.  Thus  ordinal  models  use  only  the  rank¬ 
ordering  of  values  of  Y . 

In  this  chapter  we  consider  two  of  the  most  popular  ordinal  logistic  models, 
the  proportional  odds  (PO)  form  of  an  ordinal  logistic  model  and  the  for¬ 
ward  continuation  ratio  (CR)  ordinal  logistic  model.190  Chapter  15  deals  with 
a  wider  variety  of  ordinal  models  with  emphasis  on  analysis  of  continuous  Y. 
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13.2  Ordinality  Assumption 

A  basic  assumption  of  all  commonly  used  ordinal  regression  models  is  that  the 
response  variable  behaves  in  an  ordinal  fashion  with  respect  to  each  predictor. 
Assuming  that  a  predictor  A  is  linearly  related  to  the  log  odds  of  some 
appropriate  event,  a  simple  way  to  check  for  ordinality  is  to  plot  the  mean 
of  X  stratified  by  levels  of  Y.  These  means  should  be  in  a  consistent  order. 
If  for  many  of  the  As,  two  adjacent  categories  of  Y  do  not  distinguish  the 
means,  that  is  evidence  that  those  levels  of  Y  should  be  pooled. 

One  can  also  estimate  the  mean  or  expected  value  of  X\Y  =  j  (E(X\Y  = 
j))  given  that  the  ordinal  model  assumptions  hold.  This  is  a  useful  tool  for 
checking  those  assumptions,  at  least  in  an  unadjusted  fashion.  For  simplicity, 
assume  that  X  is  discrete,  and  let  PjX  =  Pr (Y  =  j\X  =  x)  be  the  probability 
that  Y  =  j  given  X  =  x  that  is  dictated  from  the  model  being  fitted,  with 
X  being  the  only  predictor  in  the  model.  Then 

Pr(A  =  x\Y  =  j )  =  Pr(T  =  j\X  =  x)  Pr(A  =  x)/Pr(Y  =  j ) 

E(X\Y  =  j)  =  J2xPJ*  Pr(*  =  z)/Pr (Y  =  j),  (13.1) 

X 

and  the  expectation  can  be  estimated  by 

E(X\Y  =  j)  =  EjXpjxfx/gj,  (13.2) 

X 


where  PjX  denotes  the  estimate  of  PjX  from  the  fitted  one-predictor  model 
(for  inner  values  of  Y  in  the  PO  models,  these  probabilities  are  differences 
between  terms  given  by  Equation  13.4  below),  fx  is  the  frequency  of  X  =  x 
in  the  sample  of  size  n,  and  g3  is  the  frequency  of  Y  =  j  in  the  sample.  This 
estimate  can  be  computed  conveniently  without  grouping  the  data  by  A.  For 
n  subjects  let  the  n  values  of  A  be  ri,r2,...,rn.  Then 

n 

E{X\Y  =j)  =  yy, gj.  (13.3) 

i—  1 

Note  that  if  one  were  to  compute  differences  between  conditional  means  of  A 
and  the  conditional  means  of  A  given  PO,  and  if  furthermore  the  means  were 
conditioned  on  Y  >  j  instead  of  Y  =  j,  the  result  would  be  proportional  to 
means  of  score  residuals  defined  later  in  Equation  13.6. 
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13.3  Proportional  Odds  Model 
13.3.1  Model 


The  most  commonly  used  ordinal  logistic  model  was  described  in  Walker 
and  Duncan647  and  later  called  the  proportional  odds  (PO)  model  by  Mc- 
Cullagh.449  The  PO  model  is  best  stated  as  follows,  for  a  response  variable 
having  levels  0, 1,  2, . . . ,  k: 


Pr[Y  >  j\X] 


1 


1  +  exp [—{oLj  +  X/3)\  ’ 


(13.4) 


where  j  =  1,  2, . . . ,  k.  Some  authors  write  the  model  in  terms  of  Y  <  j.  Our 
formulation  makes  the  model  coefficients  consistent  with  the  binary  logistic 
model.  There  are  k  intercepts  (as).  For  fixed  j,  the  model  is  an  ordinary 
logistic  model  for  the  event  Y  >  j.  By  using  a  common  vector  of  regression 
coefficients  f3  connecting  probabilities  for  varying  j,  the  PO  model  allows  for 
parsimonious  modeling  of  the  distribution  of  Y. 

There  is  a  nice  connection  between  the  PO  model  and  the  Wilcoxon- 
Mann- Whitney  two-sample  test:  when  there  is  a  single  predictor  X\  that  is 
binary,  the  numerator  of  the  score  test  for  testing  Hq  :  j3\  =  0  is  proportional 
to  the  two-sample  test  statistic  [664,  pp.  2258-2259]. 


13.3.2  Assumptions  and  Interpretation  of  Parameters 

There  is  an  implicit  assumption  in  the  PO  model  that  the  regression  coef¬ 
ficients  (/3)  are  independent  of  j,  the  cutoff  level  for  Y.  One  could  say  that 
there  is  no  X  x  Y  interaction  if  PO  holds.  For  a  specific  T-cutoff  j,  the  model 
has  the  same  assumptions  as  the  binary  logistic  model  (Section  10.1.1).  That 
is,  the  model  in  its  simplest  form  assumes  the  log  odds  that  Y  >  j  is  linearly 
related  to  each  X  and  that  there  is  no  interaction  between  the  As. 

In  designing  clinical  studies,  one  sometimes  hears  the  statement  that  an 
ordinal  outcome  should  be  avoided  since  statistical  tests  of  patterns  of  those 
outcomes  are  hard  to  interpret.  In  fact,  one  interprets  effects  in  the  PO  model 
using  ordinary  odds  ratios.  The  difference  is  that  a  single  odds  ratio  is  as¬ 
sumed  to  apply  equally  to  all  events  Y  >  jj  =  1,2,...,  k.  If  linearity  and 
additivity  hold,  the  Xm  +  1  :  Xm  odds  ratio  for  Y  >  j  is  exp(/3m),  whatever 
the  cutoff  j. 

The  proportional  hazards  assumption  is  frequently  violated,  just  as  the  as¬ 
sumptions  of  normality  of  residuals  with  equal  variance  in  ordinary  regression 
are  frequently  violated,  but  the  PO  model  can  still  be  useful  and  powerful  in 
this  situation.  As  stated  by  Senn  and  Julious564, 
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Clearly,  the  dependence  of  the  proportional  odds  model  on  the  assumption 
of  proportionality  can  be  over-stressed.  Suppose  that  two  different  statisticians 
would  cut  the  same  three-point  scale  at  different  cut  points.  It  is  hard  to  see  how 
anybody  who  could  accept  either  dichotomy  could  object  to  the  compromise 
answer  produced  by  the  proportional  odds  model. 

Sometimes  it  helps  in  interpreting  the  model  to  estimate  the  mean  Y  as 
a  function  of  one  or  more  predictors,  even  though  this  assumes  a  spacing  for 
the  Y-levels. 


13.3.3  Estimation 

The  PO  model  is  fitted  using  MLE  on  a  somewhat  complex  likelihood  function 
that  is  dependent  on  differences  in  logistic  model  probabilities.  The  estimation 
process  forces  the  as  to  be  in  descending  order. 


13.3.4  Residuals 


Schoenfeld  residuals557  are  very  effective23  in  checking  the  proportional  haz¬ 
ards  assumption  in  the  Cox132  survival  model.  For  the  PO  model  one  could 
analogously  compute  each  subject’s  contribution  to  the  first  derivative  of 
the  log  likelihood  function  with  respect  to  /3m,  average  them  separately  by 
levels  of  T,  and  examine  trends  in  the  residual  plots  as  in  Section  20.6.2. 
A  few  examples  have  shown  that  such  plots  are  usually  hard  to  interpret. 
Easily  interpreted  score  residual  plots  for  the  PO  model  can  be  constructed, 
however,  by  using  the  fitted  PO  model  to  predict  a  series  of  binary  events 
Y  >  j,  j  =  1,2 ,...,&,  using  the  corresponding  predicted  probabilities 


1 


1  -j-  exp[— (dj  -j-  Xi 


(13.5) 


where  Xi  stands  for  a  vector  of  predictors  for  subject  i.  Then,  after  forming 
an  indicator  variable  for  the  event  currently  being  predicted  ([Y$  >  j]),  one 
computes  the  score  (first  derivative)  components  Uim  from  an  ordinary  binary 
logistic  model: 

Uim  =  Xim([Yi  >  j }  -  Pij),  (13.6) 

for  the  subject  i  and  predictor  m.  Then,  for  each  column  of  £7,  plot  the  mean 
U.m  and  confidence  limits,  with  Y  (i.e.,  j )  on  the  x-axis.  For  each  predictor 
the  trend  against  j  should  be  flat  if  PO  holds.  aIn  binary  logistic  regression, 
partial  residuals  are  very  useful  as  they  allow  the  analyst  to  fit  linear  effects 


a  If  d  were  derived  from  separate  binary  fits,  all  U.m  =  0. 
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for  all  the  predictors  but  then  to  nonparametrically  estimate  the  true  trans¬ 
formation  that  each  predictor  requires  (Section  10.4).  The  partial  residual  is 
defined  as  follows,  for  the  ith  subject  and  mth  predictor  variable.115,373 


im 


(13.7) 


where 


1 


1  +  exp[—  (a  +  Xi 


(13.8) 


A  smoothed  plot  (e.g.,  using  the  moving  linear  regression  algorithm  in 
loess111)  of  Xim  against  provides  a  nonparametric  estimate  of  how  Xm 
relates  to  the  log  relative  odds  that  Y  =  l\Xm.  For  ordinal  Y ,  we  just  need 
to  compute  binary  model  partial  residuals  for  all  cutoffs  j: 


r  —  R  X  4-  — 1  ~  ^ ^  fl  3  cp 

Pi  A  i  -  Pa) 

then  to  make  a  plot  for  each  m  showing  smoothed  partial  residual  curves  for 
all  j,  looking  for  similar  shapes  and  slopes  for  a  given  predictor  for  all  j.  Each 
curve  provides  an  estimate  of  how  Xm  relates  to  the  relative  log  odds  that 
Y  >  j .  Since  partial  residuals  allow  examination  of  predictor  transformations 
(linearity)  while  simultaneously  allowing  examination  of  PO  (parallelism), 
partial  residual  plots  are  generally  preferred  over  score  residual  plots  for  or¬ 
dinal  models. 

Li  and  Shepherd40"  have  a  residual  for  ordinal  models  that  serves  for  the 
entire  range  of  Y  without  the  need  to  consider  cutoffs.  Their  residual  is  use¬ 
ful  for  checking  functional  form  of  predictors  but  not  the  proportional  odds 
assumption. 


13.3.5  Assessment  of  Model  Fit 


Peterson  and  Harrell  12  developed  score  and  likelihood  ratio  tests  for  testing 
the  PO  assumption.  The  score  test  is  used  in  the  SAS  PROC  logistic,540 
but  its  extreme  anti-conservatism  in  many  cases  can  make  it  unreliable.502 

For  determining  whether  the  PO  assumption  is  likely  to  be  satisfied  for 
each  predictor  separately,  there  are  several  graphics  that  are  useful.  One  is  the 
graph  comparing  means  of  X\Y  with  and  without  assuming  PO,  as  described 
in  Section  13.2  (see  Figure  14.2  for  an  example).  Another  is  the  simple  method 
of  stratifying  on  each  predictor  and  computing  the  logits  of  all  proportions  of 
the  form  Y  >3,3  =  1,2,...,*.  When  proportional  odds  holds,  the  differences 
in  logits  between  different  values  of  j  should  be  the  same  at  all  levels  of  A, 
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because  the  model  dictates  that  logit (Y  >  j \X)  —  logit (Y  >  i\X)  =  olj  —  ay, 
for  any  constant  X.  An  example  of  this  is  in  Figure  13.1. 

■ 

require (Hmisc ) 


getHdata ( support ) 

L 

sfdm  V-  as  .  int  eger  (  support  $ 
sf  V-  function(y) 

sf  dm2  ) 

-  1 

c  (  '  Y  >  1  ' =qlogis  (mean  (y  > 

D)  , 

' Y > 2 ' =qlogis (mean (y  >  2)), 

' Y > 3 ' =qlogis (mean (y  > 

3))) 

s  V-  summary (sfdm  ~  adlsc  + 

sex  + 

age  +  meanbp ,  fun=sf , 

dat a  =  support ) 

plot(s,  which=l:3,  pch=l:3, 

xlab  = 

'logit  '  ,  vnames  =  'names  '  , 

main  =  '  '  ,  width. f actor  = 

1.5) 

#  Figure  13.1 

N 


282 

150 

199 

210 


377 

464 


211 

210 

210 

210 


211 

216 

204 

210 


841 


Fig.  13.1  Checking  PO  assumption  separately  for  a  series  of  predictors.  The  circle, 
triangle,  and  plus  sign  correspond  to  Y  >  1,  2,  3,  respectively.  PO  is  checked  by 
examining  the  vertical  constancy  of  distances  between  any  two  of  these  three  symbols. 
Response  variable  is  the  severe  functional  disability  scale  sf  dm2  from  the  1000-patient 
SUPPORT  dataset,  with  the  last  two  categories  combined  because  of  low  frequency 
of  coma/intubation. 


When  Y  is  continuous  or  almost  continuous  and  X  is  discrete,  the  PO  model 
assumes  that  the  logit  of  the  cumulative  distribution  function  of  Y  is  parallel 


13.3  Proportional  Odds  Model 


317 


across  categories  of  X.  The  corresponding,  more  rigid,  assumptions  of  the 
ordinary  linear  model  (here,  parametric  ANOVA)  are  parallelism  and  linear¬ 
ity  if  the  normal  inverse  cumulative  distribution  function  across  categories 
of  X.  As  an  example  consider  the  web  site’s  diabetes  dataset,  where  we  con¬ 
sider  the  distribution  of  log  glycohemoglobin  across  subjects’  body  frames. 

L 

getHdata(diabetes) 

a  V-  Ecdf log(glyhb),  group=frame ,  fun=qnorm , 

xlab= 'log(HbAlc)  '  ,  label. curves =FALSE  ,  data  =  diabetes  , 
ylab = expr e s s i on (paste (PhiA-l ,  (F[n](x)))))  #  Fig.  13.2 

b  V-  Ecdf  (A  log(glyhb)  ,  group  =  frame  ,  fun  =  qlogis  , 

xlab=  'log(HbAlc)  '  ,  label . curves  =  list (keys  =  'lines  ')  , 
data  =  diabetes  ,  ylab  =  expression (logit (F  [n]  (x) ) ) ) 
print(a,  more=TRUE,  split =c  ( 1  , 1 , 2 , 1) ) 
print (b,  split =c (2 , 1 , 2 , 1) ) 


log(HbAlc) 


log(HbAlc) 


Fig.  13.2  Transformed  empirical  cumulative  distribution  functions  stratified  by  body 
frame  in  the  diabetes  dataset.  Left  panel:  checking  all  assumptions  of  the  parametric 
ANOVA.  Right  panel:  checking  all  assumptions  of  the  PO  model  (here,  Kruskal-Wallis 
test). 


One  could  conclude  the  right  panel  of  Figure  13.2  displays  more  parallelism 
than  the  left  panel  displays  linearity,  so  the  assumptions  of  the  PO  model  are 
better  satisfied  than  the  assumptions  of  the  ordinary  linear  model. 

Chapter  14  has  many  examples  of  graphics  for  assessing  fit  of  PO  models. 
Regarding  assessment  of  linearity  and  additivity  assumptions,  splines,  partial 
residual  plots,  and  interaction  tests  are  among  the  best  tools.  Fagerland  and 
Hosmer18  have  a  good  review  of  goodness-of-fit  tests  for  the  PO  model. 
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13.3.6  Quantifying  Predictive  Ability 

The  R2n  coefficient  is  really  computed  from  the  model  LR  y2  (y2  added  to 

a  model  containing  only  the  k  intercept  parameters)  to  describe  the  model’s 

/\ 

predictive  power.  The  Somers’  Dxy  rank  correlation  between  X/3  and  Y  is 
an  easily  interpreted  measure  of  predictive  discrimination.  Since  it  is  a  rank 
measure,  it  does  not  matter  which  intercept  a  is  used  in  the  calculation. 
The  probability  of  concordance,  c,  is  also  a  useful  measure.  Here  one  takes  all 
possible  pairs  of  subjects  having  differing  Y  values  and  computes  the  fraction 

A 

of  such  pairs  for  which  the  values  of  X/3  are  in  the  same  direction  as  the  two 
Y  values,  c  could  be  called  a  generalized  ROC  area  in  this  setting.  As  before, 
Dxy  =  2 (c  —  0.5).  Note  that  Dxy ,  c,  and  the  Brier  score  B  can  easily  be 
computed  for  various  dichotomizations  of  T,  to  investigate  predictive  ability 
in  more  detail. 


13.3.7  Describing  the  Fitted  Model 

As  discussed  in  Section  5.1,  models  are  best  described  by  computing  predicted 
values  or  differences  in  predicted  values.  For  PO  models  there  are  four  and 
sometimes  five  types  of  relevant  predictions: 

1.  logit fy  >  ?  I Xl,  i.e.,  the  linear  predictor 

2.  Prob[T  >  j\X\ 

3.  Prob[T  =  j\X] 

4.  Quantiles  of  Y\X  (e.g.,  the  median  ) 

5.  E(Y\X)  if  Y  is  interval  scaled. 

For  the  first  two  quantities  above  a  good  default  choice  for  j  is  the  middle 
category.  Partial  effect  plots  are  as  simple  to  draw  for  PO  models  as  they  are 
for  binary  logistic  models.  Other  useful  graphics,  as  before,  are  odds  ratio 
charts  and  nomograms.  For  the  latter,  an  axis  displaying  the  predicted  mean 
makes  the  model  more  interpretable,  under  scaling  assumptions  on  Y . 


13.3.8  Validating  the  Fitted  Model 

The  PO  model  is  validated  much  the  same  way  as  the  binary  logistic  model 
(see  Section  10.9).  For  estimating  an  overfitting-corrected  calibration  curve 
(Section  10.11)  one  estimates  Pr(T  >  j\X)  using  one  j  at  a  time. 


b  If  Y  does  not  have  very  many  levels,  the  median  will  be  a  discontinuous  function 
of  X  and  may  not  be  satisfactory. 
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13.3.9  R  Functions 

The  rms  package’s  lrm  and  orm  functions  fit  the  PO  model  directly,  assuming 
that  the  levels  of  the  response  variable  (e.g.,  the  levels  of  a  factor  variable) 
are  listed  in  the  proper  order,  lrm  is  intended  to  be  used  for  the  case  where  the 
number  of  unique  values  of  Y  are  less  than  a  few  dozen  whereas  orm  handles 
the  continuous  Y  case  efficiently,  as  well  as  allowing  for  links  other  than  the 
logit.  See  Chapter  15  for  more  information. 

If  the  response  is  numeric,  lrm  assumes  the  numeric  codes  properly  order 
the  responses.  If  it  is  a  character  vector  and  is  not  a  factor,  lrm  assumes  the 
correct  ordering  is  alphabetic.  Of  course  ordered  variables  in  R  are  appropriate 
response  variables  for  ordinal  regression.  The  predict  function  (predict .lrm) 
can  compute  all  the  quantities  listed  in  Section  13.3.7  except  for  quantiles. 

The  R  functions  popower  and  posamsize  (in  the  Hmisc  package)  compute 
power  and  sample  size  estimates  for  ordinal  responses  using  the  proportional 
odds  model. 

The  function  plot  .xmean.ordinaly  in  rms  computes  and  graphs  the  quanti¬ 
ties  described  in  Section  13.2.  It  plots  simple  Y-stratified  means  overlaid  with 
E(X\Y  =  j),  with  j  on  the  x-axis.  The  E s  are  computed  for  both  PO  and  con¬ 
tinuation  ratio  ordinal  logistic  models.  The  Hmisc  package’s  summary. formula 
function  is  also  useful  for  assessing  the  PO  assumption  (Figure  13.1).  Generic 
rms  functions  such  as  validate,  calibrate,  and  nomogram  work  with  PO  model 
fits  from  lrm  as  long  as  the  analyst  specifies  which  intercept (s)  to  use.  rms  has 
a  special  function  generator  Mean  for  constructing  an  easy-to-use  function  for 
getting  the  predicted  mean  Y  from  a  PO  model.  This  is  handy  with  plot  and 
nomogram.  If  the  fit  has  been  run  through  the  bootcov  function,  it  is  easy  to 
use  the  Predict  function  to  estimate  bootstrap  confidence  limits  for  predicted 
means. 


13.4  Continuation  Ratio  Model 


13.4-1  Model 

Unlike  the  PO  model,  which  is  based  on  cumulative  probabilities,  the  contin¬ 
uation  ratio  (CR)  model  is  based  on  conditional  probabilities.  The  (forward) 
CR  model31,52,190  is  stated  as  follows  for  Y  =  0, . . . ,  k. 


Pr(Y  =  j\Y  >  j,  X) 


logit  (Y 


0| Y  >  0,X) 


1 

1  +  exp  [—(0j  +  Xj)\ 
logit  (Y”  =  0\X) 

0q  +  Xy 


(13.10) 
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logit  (Y 


1\Y  >  1,X) 


0i  +  X7 


logit  (y  =  k 


1  Y>  k 


l,X) 


0k— 1  +  xr 


The  CR  model  has  been  said  to  be  likely  to  fit  ordinal  responses  when  subjects 
have  to  “pass  through”  one  category  to  get  to  the  next.  The  CR  model  is  a 
discrete  version  of  the  Cox  proportional  hazards  model.  The  discrete  hazard 
function  is  defined  as  Pr(T  =  j\Y  >  j). 


13.4-2  Assumptions  and  Interpretation  of  Parameters 

The  CR  model  assumes  that  the  vector  of  regression  coefficients,  7,  is  the 
same  regardless  of  which  conditional  probability  is  being  computed. 

One  could  say  that  there  is  no  X  x  condition  interaction  if  the  CR  model 
holds.  For  a  specific  condition  Y  >  3,  the  model  has  the  same  assumptions  as 
the  binary  logistic  model  (Section  10.1.1).  That  is,  the  model  in  its  simplest 
form  assumes  that  the  log  odds  that  Y  =  j  conditional  on  Y  >  j  is  linearly 
related  to  each  X  and  that  there  is  no  interaction  between  the  Xs. 

A  single  odds  ratio  is  assumed  to  apply  equally  to  all  conditions  Y  >  j,  j  = 
0,l,2,...,fc  —  1.  If  linearity  and  additivity  hold,  the  Xm  -f  1  :  Xm  odds  ratio 
for  Y  =  j  is  exp(/3m),  whatever  the  conditioning  event  Y  >  j. 

To  compute  Pr(T  >  0|X)  from  the  CR  model,  one  only  needs  to  take 
one  minus  Pr(T  =  0|X).  To  compute  other  unconditional  probabilities  from 
the  CR  model,  one  must  multiply  the  conditional  probabilities.  For  example, 
Pr(y  >  1\X)  =  Pr(y  >  1| X,Y  >  1)  x  Pr(T  >  1\X)  =  [1  -  Pr (Y  =  1| Y  > 
1,  X)]  [1  -  Pr(T  =  0|X)]  =  [1  -  1/(1  +  exp[-(0!  +X7)])]  [1  -  1/(1  +  exp[-(0o  + 

W)])]- 


13.4-3  Estimation 

Armstrong  and  Sloan31  and  Berridge  and  Whitehead52  showed  how  the  CR 
model  can  be  fitted  using  an  ordinary  binary  logistic  model  likelihood  func¬ 
tion,  after  certain  rows  of  the  X  matrix  are  duplicated  and  a  new  binary  Y 
vector  is  constructed.  For  each  subject,  one  constructs  separate  records  by 
considering  successive  conditions  Y  >  0 ,  T  >  1, . . . ,  Y  >  k  —  1  for  a  response 
variable  with  values  0, 1, . . . ,  k.  The  binary  response  for  each  applicable  con¬ 
dition  or  “cohort”  is  set  to  1  if  the  subject  failed  at  the  current  “cohort”  or 
“risk  set,”  that  is,  if  Y  =  j  where  the  cohort  being  considered  is  Y  >  j.  The 
constructed  cohort  variable  is  carried  along  with  the  new  X  and  Y .  This  vari¬ 
able  is  considered  to  be  categorical  and  its  coefficients  are  fitted  by  adding 
k  —  1  dummy  variables  to  the  binary  logistic  model.  For  ease  of  computation, 


13.4  Continuation  Ratio  Model 


321 


the  CR  model  is  restated  as  follows,  with  the  first  cohort  used  as  the  reference 

cell. 


Pr(F  =  j\Y  >  j,X) 


1 


(13.11) 


1  +  exp[— (a  +  Oj  +  X7)]  * 

Here  a  is  an  overall  intercept,  Oq  =  0,  and  6 1, . . . ,  Ok-i  are  increments  from  a. 


13.4-4  Residuals 

To  check  CR  model  assumptions,  binary  logistic  model  partial  residuals  are 
again  valuable.  We  separately  fit  a  sequence  of  binary  logistic  models  using  a 
series  of  binary  events  and  the  corresponding  applicable  (increasingly  small) 
subsets  of  subjects,  and  plot  smoothed  partial  residuals  against  X  for  all  of 
the  binary  events.  Parallelism  in  these  plots  indicates  that  the  CR  model’s 
constant  7  assumptions  are  satisfied. 


13.4-5  Assessment  of  Model  Fit 

The  partial  residual  plots  just  described  are  very  useful  for  checking  the 
constant  slope  assumption  of  the  CR  model.  The  next  section  shows  how  to 
test  this  assumption  formally.  Linearity  can  be  assessed  visually  using  the 
smoothed  partial  residual  plot,  and  interactions  between  predictors  can  be 
tested  as  usual. 


13.4-6  Extended  CR  Model 


The  PO  model  has  been  extended  by  Peterson  and  Harrell  02  to  allow  for 
unequal  slopes  for  some  or  all  of  the  Xs  for  some  or  all  levels  of  Y.  This  partial 
PO  model  requires  specialized  software.  The  CR  model  can  be  extended  more 
easily.  In  R  notation,  the  ordinary  CR  model  is  specified  as 

y  ~  cohort  +  XI  +  X2  +  X3  +  ... 

with  cohort  denoting  a  polytomous  variable.  The  CR  model  can  be  extended 
to  allow  for  some  or  all  of  the  /3s  to  change  with  the  cohort  or  T-cutoff.31 
Suppose  that  non-constant  slope  is  allowed  for  XI  and  X2.  The  R  notation  for 
the  extended  model  would  be 

y  rsj  cohort*(Xl  +  X2)  +  X3 
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The  extended  CR  model  is  a  discrete  version  of  the  Cox  survival  model  with 
time-dependent  covariables. 

There  is  nothing  about  the  CR  model  that  makes  it  fit  a  given  dataset 
better  than  other  ordinal  models  such  as  the  PO  model.  The  real  benefit  of 
the  CR  model  is  that  using  standard  binary  logistic  model  software  one  can 
flexibly  specify  how  the  equal-slopes  assumption  can  be  relaxed. 


13.4-7  Role  of  Penalization  in  Extended  CR  Model 

As  demonstrated  in  the  upcoming  case  study,  penalized  MLE  is  invaluable  in 
allowing  the  model  to  be  extended  into  an  unequal-slopes  model  insofar  as  the 
information  content  in  the  data  will  support.  Faraway181  has  demonstrated 
how  all  data-driven  steps  of  the  modeling  process  increase  the  real  variance  in 
“final11  parameter  estimates,  when  one  estimates  variances  without  assuming 
that  the  final  model  was  prespecified.  For  ordinal  regression  modeling,  the 
most  important  modeling  steps  are  (1)  choice  of  predictor  variables,  (2)  se¬ 
lecting  or  modeling  predictor  transformations,  and  (3)  allowance  for  unequal 
slopes  across  F-cutoffs  (i.e.,  non-PO  or  non-CR).  Regarding  Steps  (2)  and  (3) 
one  is  tempted  to  rely  on  graphical  methods  such  as  residual  plots  to  make 
detours  in  the  strategy,  but  it  is  very  difficult  to  estimate  variances  or  to 
properly  penalize  assessments  of  predictive  accuracy  for  subjective  modeling 
decisions.  Regarding  (1),  shrinkage  has  been  proven  to  work  better  than  step¬ 
wise  variable  selection  when  one  is  attempting  to  build  a  main-effects  model. 
Choosing  a  shrinkage  factor  is  a  well-defined,  smooth,  and  often  a  unique 
process  as  opposed  to  binary  decisions  on  whether  variables  are  “in”  or  “out” 
of  the  model.  Likewise,  instead  of  using  arbitrary  subjective  (residual  plots) 
or  objective  (y2  due  to  cohort  x  covariable  interactions,  i.e.,  non-constant 
covariable  effects),  shrinkage  can  systematically  allow  model  enhancements 
insofar  as  the  information  content  in  the  data  will  support,  through  the  use  of 
differential  penalization.  Shrinkage  is  a  solution  to  the  dilemma  faced  when 
the  analyst  attempts  to  choose  between  a  parsimonious  model  and  a  more 
complex  one  that  fits  the  data.  Penalization  does  not  require  the  analyst  to 
make  a  binary  decision,  and  it  is  a  process  that  can  be  validated  using  the 
bootstrap. 


13.4-8  Validating  the  Fitted  Model 

Validation  of  statistical  indexes  such  as  Dxy  and  model  calibration  is  done 
using  techniques  discussed  previously,  except  that  certain  problems  must  be 
addressed.  First,  when  using  the  bootstrap,  the  resampling  must  take  into  ac¬ 
count  the  existence  of  multiple  records  per  subject  that  were  created  to  use 
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the  binary  logistic  likelihood  trick.  That  is,  sampling  should  be  done  with  re¬ 
placement  from  subjects  rather  than  records.  Second,  the  analyst  must  isolate 
which  event  to  predict.  This  is  because  when  observations  are  expanded  in 
order  to  use  a  binary  logistic  likelihood  function  to  fit  the  CR  model,  several 
different  events  are  being  predicted  simultaneously.  Somers’  Dxy  could  be 
computed  by  relating  X7  (ignoring  intercepts)  to  the  ordinal  T,  but  other 
indexes  are  not  defined  so  easily.  The  simplest  approach  here  would  be  to 
validate  a  single  prediction  for  Pr(T  =  j\Y  >  h  X),  for  example.  The  sim¬ 
plest  event  to  predict  is  Pr(T  =  0|X),  as  this  would  just  require  subsetting 
on  all  observations  in  the  first  cohort  level  in  the  validation  sample.  It  would 
also  be  easy  to  validate  any  one  of  the  later  conditional  probabilities.  The 
validation  functions  described  in  the  next  section  allow  for  such  subsetting, 
as  well  as  handling  the  cluster  sampling.  Specialized  calculations  would  be 
needed  to  validate  an  unconditional  probability  such  as  Pr(T  >  2|X). 


13.4-9  R  Functions 

The  cr. setup  function  in  rms  returns  a  list  of  vectors  useful  in  constructing 
a  dataset  used  to  trick  a  binary  logistic  function  such  as  lrm  into  fitting 
CR  models.  The  subs  vector  in  this  list  contains  observation  numbers  in  the 
original  data,  some  of  which  are  repeated.  Here  is  an  example. 
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Since  the  lrm  and  pentrace  functions  have  the  capability  to  penalize  dif¬ 
ferent  parts  of  the  model  by  different  amounts,  they  are  valuable  for  fitting 
extended  CR  models  in  which  the  cohort  x  predictor  interactions  are  allowed 
to  be  only  as  important  as  the  information  content  in  the  data  will  support. 
Simple  main  effects  can  be  unpenalized  or  slightly  penalized  as  desired. 

The  validate  and  calibrate  functions  for  lrm  allow  specification  of  sub¬ 
ject  identifiers  when  using  the  bootstrap,  so  the  samples  can  be  constructed 
with  replacement  from  the  original  subjects.  In  other  words,  cluster  sam¬ 
pling  is  done  from  the  expanded  records.  This  is  handled  internally  by  the 
predab. resample  function.  These  functions  also  allow  one  to  specify  a  subset  of 
the  records  to  use  in  the  validation,  which  makes  it  especially  easy  to  validate 
the  part  of  the  model  used  to  predict  Pr(T  =  0|X). 

The  plot .  xmean .  ordinaly  function  is  useful  for  checking  the  CR  assumption 
for  single  predictors,  as  described  earlier. 
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13.5  Further  Reading 


i 


2 


3 


4 


gee5,  25,  26,  31, 32, 52, 63,  64,  113,  126,  240,  245, 276,  354,449,  502,  561,664,679  fQr  gome 

excellent  background  references,  applications,  and  extensions  to  the  ordinal 
models.663  and428  demonstrate  how  to  model  ordinal  outcomes  with  repeated 
measurements  within  subject  using  random  effects  in  Bayesian  models.  The  first 
to  develop  an  ordinal  regression  model  were  Aitchison  and  Silvey8. 

Some  analysts  feel  that  combining  categories  improves  the  performance  of  test 
statistics  when  fitting  PO  models  when  sample  sizes  are  small  and  cells  are 
sparse.  Murad  et  al.469  demonstrated  that  this  causes  more  problems,  because 
it  results  in  overly  conservative  Wald  tests. 

Anderson  and  Philips  [26,  p.  29]  proposed  methods  for  constructing  properly 
spaced  response  values  given  a  fitted  PO  model. 

The  simplest  demonstration  of  this  is  to  consider  a  model  in  which  there  is  a 
single  predictor  that  is  totally  independent  of  a  nine-level  response  Y,  so  PO 
must  hold.  A  PO  model  is  fitted  in  SAS  using: 


5 


6 


DATA  test; 

DO  i=l  to  50; 

y=FL00R(RANUNI (151) *9) ; 
x=RANN0R(5) ; 

OUTPUT ; 

END; 

PROC  LOGISTIC;  MODEL  y=x; 

The  score  test  for  PO  was  y2  =  56  on  7  d.f.,  P  <  0.0001.  This  problem  results 
from  some  small  cell  sizes  in  the  distribution  of  T.502  The  P- value  for  testing 
the  regression  effect  for  X  was  0.76. 

The  R  glmnetcr  package  by  Kellie  Archer  provides  a  different  way  to  fit  con¬ 
tinuation  ratio  models. 

Bender  and  Benner48  have  some  examples  using  the  precursor  of  the  rms  package 
for  fitting  and  assessing  the  goodness  of  fit  of  ordinal  logistic  regression  models. 


13.6  Problems 

Test  for  the  association  between  disease  group  and  total  hospital  cost  in 

SUPPORT,  without  imputing  any  missing  costs  (exclude  the  one  patient 

having  zero  cost). 

1.  Use  the  Kruskal-Wallis  rank  test. 

2.  Use  the  proportional  odds  ordinal  logistic  model  generalization  of  the 
Wilcoxon-Mann- Whitney  Kruskal-Wallis  Spearman  test.  Group  total  cost 
into  20  quantile  groups  so  that  only  19  intercepts  will  need  to  be  in  the 
model,  not  one  less  than  the  number  of  subjects  (this  would  have  taken 
the  program  too  long  to  fit  the  model).  Use  the  likelihood  ratio  y2  for  this 
and  later  steps. 

3.  Use  a  binary  logistic  model  to  test  for  association  between  disease  group 
and  whether  total  cost  exceeds  the  median  of  total  cost.  In  other  words, 
group  total  cost  into  two  quantile  groups  and  use  this  binary  variable  as 
the  response.  What  is  wrong  with  this  approach? 
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4.  Instead  of  using  only  two  cost  groups,  group  cost  into  3,  4,  5,  6,  8,  10, 
and  12  quantile  groups.  Describe  the  relationship  between  the  number  of 
intervals  used  to  approximate  the  continuous  response  variable  and  the 
efficiency  of  the  analysis.  How  many  intervals  of  total  cost,  assuming  that 
the  ordering  of  the  different  intervals  is  used  in  the  analysis,  are  required 
to  avoid  losing  significant  information  in  this  continuous  variable? 

5.  If  you  were  selecting  one  of  the  rank-based  tests  for  testing  the  association 
between  disease  and  cost,  which  of  any  of  the  tests  considered  would  you 
choose? 

6.  Why  do  all  of  the  tests  you  did  have  the  same  number  of  degrees  of  freedom 
for  the  hypothesis  of  no  association  between  dzgroup  and  totcst? 

7.  What  is  the  advantage  of  a  rank-based  test  over  a  parametric  test  based 
on  log(cost)? 

8.  Show  that  for  a  two-sample  problem,  the  numerator  of  the  score  test  for 
comparing  the  two  groups  using  a  proportional  odds  model  is  exactly  the 
numerator  of  the  Wilcoxon-Mann- Whitney  two-sample  rank-sum  test. 


Chapter  14 

Case  Study  in  Ordinal  Regression, 
Data  Reduction,  and  Penalization 


This  case  study  is  taken  from  Harrell  et  al.  2  which  described  a  World  Health 
Organization  study43  in  which  vital  signs  and  a  large  number  of  clinical 
signs  and  symptoms  were  used  to  develop  a  predictive  model  for  an  ordinal 
response.  This  response  consists  of  laboratory  assessments  of  diagnosis  and 
severity  of  illness  related  to  pneumonia,  meningitis,  and  sepsis.  Much  of  the 
modeling  strategy  given  in  Chapter  4  was  used  to  develop  the  model,  with  ad¬ 
ditional  emphasis  on  penalized  maximum  likelihood  estimation  (Section  9.10). 
The  following  laboratory  data  are  used  in  the  response:  cerebrospinal  fluid 
(CSF)  culture  from  a  lumbar  puncture  (LP),  blood  culture  (BC),  arterial 
oxygen  saturation  (Sa02,  a  measure  of  lung  dysfunction),  and  chest  X-ray 
(CXR).  The  sample  consisted  of  4552  infants  aged  90  days  or  less. 

This  case  study  covers  these  topics: 

1.  definition  of  the  ordinal  response  (Section  14.1); 

2.  scoring  and  clustering  of  clinical  signs  (Section  14.2); 

3.  testing  adequacy  of  weights  specified  by  subject-matter  specialists  and 
assessing  the  utility  of  various  scoring  schemes  using  a  tentative  ordinal 
logistic  model  (Section  14.3); 

4.  assessing  the  basic  ordinality  assumptions  and  examining  the  propor¬ 
tional  odds  and  continuation  ratio  (PO  and  CR)  assumptions  separately 
for  each  predictor  (Section  14.4); 

5.  deriving  a  tentative  PO  model  using  cluster  scores  and  regression  splines 
(Section  14.5); 

6.  using  residual  plots  to  check  PO,  CR,  and  linearity  assumptions  (Sec¬ 
tion  14.6); 

7.  examining  the  fit  of  a  CR  model  (Section  14.7); 

8.  utilizing  an  extended  CR  model  to  allow  some  or  all  of  the  regression 
coefficients  to  vary  with  cutoffs  of  the  response  level  as  well  as  to  provide 
formal  tests  of  constant  slopes  (Section  14.8); 
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Table  14.1  Ordinal  Outcome  Scale 


Outcome 

Definition 

n 

Fraction 

in  Outcome  Level 

Level 

BC,  CXR 

Not 

Random 

Y 

Indicated 

Indicated 

Sample 

(n  =  2398) 

(n  =  1979) 

(n  =  175) 

0 

None  of  the  below 

3551 

0.63 

0.96 

0.91 

1 

90%  <  Sa02  <  95% 
or  CXR+ 

490 

0.17 

0.04a 

0.05 

2 

BC+  or  CSF+ 
or  SaC>2  <  90% 

511 

0.21 

0.00i’ 

0.03 

a  SaO 2  was  measured  but  CXR  was  not  done 
b  Assumed  zero  since  neither  BC  nor  LP  were  done. 


9.  using  penalized  maximum  likelihood  estimation  to  improve  accuracy 
(Section  14.9); 

10.  approximating  the  full  model  by  a  sub-model  and  drawing  a  nomogram 
on  the  basis  of  the  sub-model  (Section  14.10);  and 

11.  validating  the  ordinal  model  using  the  bootstrap  (Section  14.11). 


14.1  Response  Variable 

To  be  a  candidate  for  BC  and  CXR,  an  infant  had  to  have  a  clinical  indication 
for  one  of  the  three  diseases,  according  to  prespecified  criteria  in  the  study 
protocol  (n  =  2398).  Blood  work-up  (but  not  necessarily  LP)  and  CXR  was 
also  done  on  a  random  sample  intended  to  be  10%  of  infants  having  no  signs 
or  symptoms  suggestive  of  infection  (n  =  175).  Infants  with  signs  suggestive 
of  meningitis  had  LP  done.  All  4552  infants  received  a  full  physical  exam  and 
standardized  pulse  oximetry  to  measure  Sa02-  The  vast  majority  of  infants 
getting  CXR  had  the  X-rays  interpreted  by  three  independent  radiologists. 

The  analyses  that  follow  are  not  corrected  for  verification  bias68/  with 
respect  to  BC,  LP,  and  CXR,  but  Section  14.1  has  some  data  describing  the 
extent  of  the  problem,  and  the  problem  is  reduced  by  conditioning  on  a  large 
number  of  covariates. 

Patients  were  assigned  to  the  worst  qualifying  outcome  category.  Table  14.1 
shows  the  definition  of  the  ordinal  outcome  variable  Y  and  shows  the  distri¬ 
bution  of  Y  by  the  lab  work-up  strategy. 

The  effect  of  verification  bias  is  a  false  negative  fraction  of  0.03  for  Y  —  2, 
from  comparing  the  detection  fraction  of  zero  for  Y  —  2  in  the  “Not  Indicated” 
group  with  the  observed  positive  fraction  of  0.03  in  the  random  sample  that 
was  fully  worked  up.  The  extent  of  verification  bias  in  Y  =  1  is  0.05  —  0.04  = 
0.01.  These  biases  are  ignored  in  this  analysis. 
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14.2  Variable  Clustering 

Forty-seven  clinical  signs  were  collected  for  each  infant.  Most  questionnaire 
items  were  scored  as  a  single  variable  using  equally  spaced  codes,  with  0  to 
3  representing,  for  example,  sign  not  present,  mild,  moderate,  severe.  The 
resulting  list  of  clinical  signs  with  their  abbreviations  is  given  in  Table  14.2. 
The  signs  are  organized  into  clusters  as  discussed  later. 


Table  14.2  Clinical  Signs 


Cluster  Name 

Sign 

Name 

Values 

Abbreviation 

of  Sign 

bul.conv 

abb 

bulging  fontanel 

0-1 

convul 

hx  convulsion 

0-1 

hydration 

abk 

sunken  fontanel 

0-1 

hdi 

hx  diarrhoea 

0-1 

deh 

dehydrated 

0-2 

stu 

skin  turgor 

0-2 

dcp 

digital  capillary  refill 

0-2 

drowsy 

hcl 

less  activity 

0-1 

qcr 

quality  of  crying 

0-2 

csd 

drowsy  state 

0-2 

slpm 

sleeping  more 

0-1 

wake 

wakes  less  easily 

0-1 

aro 

arousal 

0-2 

mvm 

amount  of  movement 

0-2 

agitated 

hem 

crying  more 

0-1 

slpl 

sleeping  less 

0-1 

con 

consolability 

0-2 

esa 

agitated  state 

0-1 

crying 

hem 

crying  more 

0-1 

lies 

crying  less 

0-1 

qcr 

quality  of  crying 

0-2 

smi2 

smiling  ability  X  age  >  42  days 

0-2 

reffort 

nfl 

nasal  flaring 

0-3 

lew 

lower  chest  in-drawing 

0-3 

gru 

grunting 

0-2 

ccy 

central  cyanosis 

0-1 

stop. breath 

hap 

hx  stop  breathing 

0-1 

apn 

apnea 

0-1 

ausc 

whz 

wheezing 

0-1 

coh 

cough  heard 

0-1 

ers 

crepitation 

0-2 

hxprob 

hfb 

fast  breathing 

0-1 

hdb 

difficulty  breathing 

0-1 

hit 

mother  report  resp.  problems  none, 

chest,  other 

feeding 

hfa 

hx  abnormal  feeding 

0-3 

absu 

sucking  ability 

0-2 

afe 

drinking  ability 

0-2 

labor 

chi 

previous  child  died 

0-1 

fde 

fever  at  delivery 

0-1 

Idy 

days  in  labor 

1-9 

twb 

water  broke 

0-1 

abdominal 

adb 

abdominal  distension 

0-4 

jan 

jaundice 

0-1 

omph 

omphalitis 

0-1 

fever. ill 

illd 

age-adjusted  no.  days  ill 

life 

hx  fever 

0-1 

pustular 

conj 

conjunctivitis 

0-1 

oto 

otoscopy  impression 

0-2 

puskin 

pustular  skin  rash 

0-1 
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Fig.  14.1  Hierarchical  variable  clustering  using  Spearman  p2  as  a  similarity  measure 
for  all  pairs  of  variables.  Note  that  since  the  hit  variable  was  nominal,  it  is  represented 
by  two  dummy  variables  here. 


Here,  hx  stands  for  history,  ausc  for  auscultation,  and  hxprob  for  history  of 
problems.  Two  signs  (qcr,  hem)  were  listed  twice  since  they  were  later  placed 
into  two  clusters  each. 

Next,  hierarchical  clustering  was  done  using  the  matrix  of  squared  Spear¬ 
man  rank  correlation  coefficients  as  the  similarity  matrix.  The  varclus  R 
function  was  used  as  follows. 

L 

require ( rms ) 

L 

getHdat a ( ar i )  #  defines  ari ,  Sc,  Y,  Y.  death 

vclust  <— 


illd 

+ 

hit 

+ 

slpm 

+ 

slpl 

+ 

wake 

+ 

convul 

+ 

hf  a 

+ 

hfb 

+ 

hf  e 

+ 

hap 

+ 

hcl 

+ 

hem 

+ 

hes 

+ 

hdi 

+ 

f  de 

+ 

chi 

+ 

twb 

+ 

ldy 

+ 

apn 

+ 

lew 

+ 

nf  1 

+ 

str 

+ 

gru 

+ 

coh 

+ 

ccy 

+ 

jau 

+ 

omph 

+ 

csd 

+ 

esa 

+ 

ar  o 

+ 

qcr 

+ 

con 

+ 

att 

+ 

mvm 

+ 

af  e 

+ 

absu 

+ 

stu 

+ 

deh 

+ 

dep 

+ 

cr  s 

+ 

abb 

+ 

abk 

+ 

whz 

+ 

hdb 

+ 

smi  2 

+ 

abd 

+ 

con  j 

+ 

ot  0 

+ 

puskin  , 

data=ari ) 

plot (vclust)  #  Figure  14.1 

The  output  appears  in  Figure  14.1.  This  output  served  as  a  starting  point 
for  clinicians  to  use  in  constructing  more  meaningful  clinical  clusters.  The 
clusters  in  Table  14.2  were  the  consensus  of  the  clinicians  who  were  the  in¬ 
vestigators  in  the  WHO  study.  Prior  subject  matter  knowledge  plays  a  key 
role  at  this  stage  in  the  analysis. 


14.3  Developing  Cluster  Summary  Scores 

The  clusters  listed  in  Table  14.2  were  first  scored  by  the  first  principal  com¬ 
ponent  of  trans can-transformed  signs,  denoted  by  PC\.  Knowing  that  the 
resulting  weights  may  be  too  complex  for  clinical  use,  the  primary  reasons 
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Table  14.3  Clinician  Combinations,  Rankings,  and  Scorings  of  Signs 


Cluster 

Combined/Ranked  Signs  in  Order  of  Severity 

Weights 

bul.conv 

abb  U  convul 

0-1 

drowsy 

hcl,  qcr>0,  csd>0  U  slpm  U  wake,  aro>0,  mvm>0 

0-5 

agitated 

hem,  slpl,  con=l,  esa,  con=2 

0,  1,  2,  7,  8,  10 

reffort 

nfl>0,  lcw>l,  gru=l,  gru=2,  ccy 

0-5 

ausc 

whz,  coh,  crs>0 

0-3 

feeding 

hfa=l,  hfa=2,  hfa=3,  absu=l  U  afe=l, 
absu=2  U  afe=2 

0-5 

abdominal 

jau  U  abd>0  U  omph 

0-1 

for  analyzing  the  principal  components  were  to  see  if  some  of  the  clusters 
could  be  removed  from  consideration  so  that  the  clinicians  would  not  spend 
time  developing  scoring  rules  for  them.  Let  us  “peek”  at  Y  to  assist  in  scoring 
clusters  at  this  point,  but  to  do  so  in  a  very  structured  way  that  does  not 
involve  the  examination  of  a  large  number  of  individual  coefficients. 

To  judge  any  cluster  scoring  scheme,  we  must  pick  a  tentative  outcome 
model.  For  this  purpose  we  chose  the  PO  model.  By  using  the  14  PC  is  cor¬ 
responding  to  the  14  clusters,  the  fitted  PO  model  had  a  likelihood  ratio 

(LR)  y2  of  H55  with  14  d.f.,  and  the  predictive  discrimination  of  the  clus- 

/\ 

ters  was  quantified  by  a  Somers’  Dxy  rank  correlation  between  X/3  and  Y 
of  0.596.  The  following  clusters  were  not  statistically  important  predictors 
and  we  assumed  that  the  lack  of  importance  of  the  PCiS  in  predicting  Y 
(adjusted  for  the  other  PCis)  justified  a  conclusion  that  no  sign  within  that 
cluster  was  clinically  important  in  predicting  Y :  hydration,  hxprob,  pustular, 
crying,  fever. ill,  stop. breath,  labor.  This  list  was  identified  using  a  back¬ 
ward  step-down  procedure  on  the  full  model.  The  total  Wald  y2  for  these 
seven  PCis  was  22.4  (P  =  0.002).  The  reduced  model  had  LR  y2  =  1133 
with  7  d.f.,  Dxy  =  0.591.  The  bootstrap  validation  in  Section  14.11  penalizes 
for  examining  all  candidate  predictors. 

The  clinicians  were  asked  to  rank  the  clinical  severity  of  signs  within  each 
potentially  important  cluster.  During  this  step,  the  clinicians  also  ranked 
severity  levels  of  some  of  the  component  signs,  and  some  cluster  scores  were 
simplified,  especially  when  the  signs  within  a  cluster  occurred  infrequently. 
The  clinicians  also  assessed  whether  the  severity  points  or  weights  should  be 
equally  spaced,  assigning  unequally  spaced  weights  for  one  cluster  (agitated). 
The  resulting  rankings  and  sign  combinations  are  shown  in  Table  14.3.  The 
signs  or  sign  combinations  separated  by  a  comma  are  treated  as  separate 
categories,  whereas  some  signs  were  unioned  (“or”-ed)  when  the  clinicians 
deemed  them  equally  important.  As  an  example,  if  an  additive  cluster  score 
was  to  be  used  for  drowsy,  the  scorings  would  be  0  =  none  present,  1  =  hcl, 
2  =  qcr>0,  3  =  csd>0  or  slpm  or  wake,  4  =  aro>0,  5  =  mvm>0  and  the  scores 
would  be  added. 
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Table  14.4  Predictive  information  of  various  cluster  scoring  strategies.  AIC  is  on 
the  likelihood  ratio  y2  scale. 


Scoring  Method 

LR  x2 

d.f. 

AIC 

PC\  of  each  cluster 

1133 

7 

1119 

Union  of  ah  signs 

1045 

7 

1031 

Union  of  higher  categories 

1123 

7 

1109 

Hierarchical  (worst  sign) 

1194 

7 

1180 

Additive,  equal  weights 

1155 

7 

1141 

Additive  using  clinician  weights 

1183 

7 

1169 

Hierarchical,  data-driven  weights 

1227 

25 

1177 

This  table  reflects  some  data  reduction  already  (unioning  some  signs  and 
selection  of  levels  of  ordinal  signs)  but  more  reduction  is  needed.  Even  after 
signs  are  ranked  within  a  cluster,  there  are  various  ways  of  assigning  the  clus¬ 
ter  scores.  We  investigated  six  methods.  We  started  with  the  purely  statistical 
approach  of  using  PC\  to  summarize  each  cluster.  Second,  all  sign  combina¬ 
tions  within  a  cluster  were  unioned  to  represent  a  0/1  cluster  score.  Third, 
only  sign  combinations  thought  by  the  clinicians  to  be  severe  were  unioned, 
resulting  in  drowsy=aro>0  or  mvm>0,  agitated=csa  or  con=2,  ref f ort=lcw>l  or 
gru>0  or  ccy,  ausc=crs>0,  and  f eeding=absu>0  or  afe>0.  For  clusters  that  are 
not  scored  0/1  in  Table  14.3,  the  fourth  summarization  method  was  a  hi¬ 
erarchical  one  that  used  the  weight  of  the  worst  applicable  category  as  the 
cluster  score.  For  example,  if  aro=l  but  mvm=o,  drowsy  would  be  scored  as  4. 
The  fifth  method  counted  the  number  of  positive  signs  in  the  cluster.  The 
sixth  method  summed  the  weights  of  all  signs  or  sign  combinations  present. 
Finally,  the  worst  sign  combination  present  was  again  used  as  in  the  sec¬ 
ond  method,  but  the  points  assigned  to  the  category  were  data-driven  ones 
obtained  by  using  extra  dummy  variables.  This  provided  an  assessment  of 
the  adequacy  of  the  clinician-specified  weights.  By  comparing  rows  4  and  7 
in  Table  14.4  we  see  that  response  data-driven  sign  weights  have  a  slightly 
worse  AIC,  indicating  that  the  number  of  extra  j3  parameters  estimated  was 
not  justified  by  the  improvement  in  y2.  The  hierarchical  method,  using  the 
clinicians’  weights,  performed  quite  well.  The  only  cluster  with  inadequate 
clinician  weights  was  ausc — see  below.  The  PC\  method,  without  any  guid¬ 
ance,  performed  well,  as  in268.  The  only  reasons  not  to  use  it  are  that  it 
requires  a  coefficient  for  every  sign  in  the  cluster  and  the  coefficients  are  not 
translatable  into  simple  scores  such  as  0, 1, . . .. 

Representation  of  clusters  by  a  simple  union  of  selected  signs  or  of  all  signs 
is  inadequate,  but  otherwise  the  choice  of  methods  is  not  very  important  in 
terms  of  explaining  variation  in  Y .  We  chose  the  fourth  method,  a  hierar¬ 
chical  severity  point  assignment  (using  weights  that  were  prespecified  by  the 
clinicians),  for  its  ease  of  use  and  of  handling  missing  component  variables 
(in  most  cases)  and  potential  for  speeding  up  the  clinical  exam  (examining 
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to  detect  more  important  signs  first).  Because  of  what  was  learned  regard¬ 
ing  the  relationship  between  ausc  and  Y ,  we  modified  the  ausc  cluster  score 
by  redefining  it  as  ausc=crs>0  (crepitations  present).  Note  that  neither  the 
“tweaking”  of  ausc  nor  the  examination  of  the  seven  scoring  methods  dis¬ 
played  in  Table  14.4  is  taken  into  account  in  the  model  validation. 


14.4  Assessing  Ordinality  of  Y  for  each  X, 

and  Unadjusted  Checking  of  PO  and  CR 
Assumptions 


Section  13.2  described  a  graphical  method  for  assessing  the  ordinality  as¬ 
sumption  for  Y  separately  with  respect  to  each  X,  and  for  assessing  PO  and 
CR  assumptions  individually.  Figure  14.2  is  an  example  of  such  displays.  For 
this  dataset  we  expect  strongly  nonlinear  effects  for  temp,  rr,  and  hrat,  so  for 
those  predictors  we  plot  the  mean  absolute  differences  from  suitable  “normal” 
values  as  an  approximate  solution. 

L 

Sc  transf  orm  ( Sc  , 

ausc  =  1  *  (ausc  ==  3)  , 
bul.conv  =  1  *  (bul.conv  ==  'TRUE'), 
abdominal  =  1  *  (abdominal  ==  'TRUE')) 
plot . xmean . ordinaly ( Y  ~  age  +  abs(temp-37)  +  abs(rr-60)  + 

abs ( hr at - 1 25  )  +  waz  +  bul.conv  +  drowsy  + 

agitated  +  reffort  +  ausc  +  feeding  + 
abdominal  ,  data  =  Sc  ,  cr  =  TRUE  , 
subn  =  FALSE  ,  cex . po int s  =  .  65 )  #  Figure  14.2 


The  plot  is  shown  in  Figure  14.2.  Y  does  not  seem  to  operate  in  an  ordinal 
fashion  with  respect  to  age, 


rr- 


60  ,  or  ausc.  For  the  other  variables,  ordinality 
holds,  and  PO  holds  reasonably  well  for  the  other  variables.  For  heart  rate, 
the  PO  assumption  appears  to  be  satisfied  perfectly.  CR  model  assumptions 
appear  to  be  more  tenuous  than  PO  assumptions,  when  one  variable  at  a 
time  is  fitted. 


14.5  A  Tentative  Full  Proportional  Odds  Model 

Based  on  what  was  determined  in  Section  14.3,  the  original  list  of  47  signs 
was  reduced  to  seven  predictors:  two  unions  of  signs  (bul.conv,  abdominal), 
one  single  sign  (ausc),  and  four  “worst  category”  point  assignments  (drowsy, 
agitated,  reffort,  feeding).  Seven  clusters  were  dropped  for  the  time  being 
because  of  weak  associations  with  Y .  Such  a  limited  use  of  variable  selection 
reduces  the  severe  problems  inherent  with  that  technique. 
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Fig.  14.2  Examination  of  the  ordinality  of  Y  for  each  predictor  by  assessing  how 
varying  Y  relate  to  the  mean  X ,  and  whether  the  trend  is  monotonic.  Solid  lines 
connect  the  simple  stratified  means,  and  dashed  lines  connect  the  estimated  expected 
value  of  X\Y  =  j  given  that  PO  holds.  Estimated  expected  values  from  the  CR  model 
are  marked  with  Cs. 


At  this  point  in  model  development  add  to  the  model  age  and  vital  signs: 
temp  (temperature),  rr  (respiratory  rate),  hrat  (heart  rate),  and  waz,  weight- 
for-age  Z-score.  Since  age  was  expected  to  modify  the  interpretation  of  temp, 
rr,  and  hrat,  and  interactions  between  continuous  variables  would  be  difficult 
to  use  in  the  field,  we  categorized  age  into  three  intervals:  0-6  days  (n  =  302), 
7-59  days  (n  =  3042),  and  60-90  days  (n  =  1208). a 

Sc$ageg  «<—  cut2(Sc$age,  c (7 ,  60)) 

The  new  variables  temp,  rr,  hrat,  waz  were  missing  in,  respectively,  n  = 
13,  11,  147,  and  20  infants.  Since  the  three  vital  sign  variables  are  somewhat 
correlated  with  each  other,  customized  single  imputation  models  were  de¬ 
veloped  to  impute  all  the  missing  values  without  assuming  linearity  or  even 
monotonicity  of  any  of  the  regressions. 


a  These  age  intervals  were  also  found  to  adequately  capture  most  of  the  interaction 
effects. 
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vs ign . t r ans 


transcan O  temp  +  hrat 
imput ed=TRUE , 


+  rr ,  data=Sc , 
pi =FALSE ) 


Convergence  cr it er ion : 2 . 222  0.643  0.191  0.056  0.016 
Convergence  in  6  iterations 
R "  achieved  in  predicting  each  variable: 

temp  hrat  rr 

0.168  0.160  0.066 

Adjusted  R 2  : 

temp  hrat  rr 

0.167  0.159  0.064 


Sc  V-  transf  orm  ( Sc  , 

i 

temp  = 

imput e ( vs ign . trans  , 

temp )  , 

hrat  = 

imput e ( vs ign . trans  , 

hrat )  , 

rr  = 

imput e ( vs ign . trans  , 

rr  )  ) 

After  transcan  estimated  optimal  restricted  cubic  spline  transformations,  temp 
could  be  predicted  with  adjusted  R2  =  0.17  from  hrat  and  rr,  hrat  could  be 
predicted  with  adjusted  R2  =  0.16  from  temp  and  rr,  and  rr  could  be  pre¬ 
dicted  with  adjusted  R2  of  only  0.06.  The  first  two  A*2,  while  not  large,  mean 
that  customized  imputations  are  more  efficient  than  imputing  with  constants. 
Imputations  on  rr  were  closer  to  the  median  rr  of  48/minute  as  compared 
with  the  other  two  vital  signs  whose  imputations  have  more  variation.  In  a 
similar  manner,  waz  was  imputed  using  age,  birth  weight,  head  circumference, 
body  length,  and  prematurity  (adjusted  R2  for  predicting  waz  from  the  oth¬ 
ers  was  0.74).  The  continuous  predictors  temp,  hrat,  rr  were  not  assumed  to 
linearly  relate  to  the  log  odds  that  Y  >  j.  Restricted  cubic  spline  functions 
with  five  knots  for  temp,rr  and  four  knots  for  hrat, waz  were  used  to  model 
the  effects  of  these  variables: 

L 

fl  V-  lrm(Y  ageg*  (res  ( temp  ,  5)  +  rcs  (rr  ,  5)  +  rcs  (hrat  ,  4)  )  + 

res (waz, 4)  +  bul.conv  +  drowsy  +  agitated  + 

reffort  +  ausc  +  feeding  +  abdominal  , 
data  =  Sc  ,  x  =  TRUE  ,  y  =  TRUE) 

#  x  =  TRUE ,  y  =  TRUE  used  by  resid()  below 
print  (fl,  latex=TRUE,  coefs=5) 


Logistic  Regression  Model 

lrm (formula  =  Y  ageg  *  (res (temp,  5)  +  rcs(rr,  5)  +  res (hrat, 
4))  +  res (waz,  4)  +  bul.conv  +  drowsy  +  agitated  +  reffort  + 
ausc  +  feeding  +  abdominal,  data  =  Sc,  x  =  TRUE,  y  =  TRUE) 
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Model  Likelihood 
Ratio  Test 

Discrimination 

Indexes 

Rank  Discrim. 
Indexes 

Obs  4552 

0  3551 

1  490 

2  511 

max  dl°dgL  2 x  10-6 

LR  x'2  1393.18 

d.f.  45 

Pr(>  x2)  <  0.0001 

R2  0.355 

g  1.485 

gr  4.414 

gp  0.225 

Brier  0.120 

C  0.826 

Dxy  0.653 

7  0.654 

Ta  0.240 

Coef  S.E.  Wald  Z  Pr(>  \Z\) 


y>! 

0.0653 

7.6563 

0.01 

0.9932 

y>2 

-1.0646 

7.6563 

-0.14 

0.8894 

ageg=[  7,60) 

9.5590 

9.9071 

0.96 

0.3346 

ageg=  [60,90] 

29.1376 

15.8915 

1.83 

0.0667 

temp 

-0.0694 

0.2160 

-0.32 

0.7480 

Wald  tests  of  nonlinearity  and  interaction  are  shown  in  Table  14.5. 

L 

latex (anova(fl)  ,  f ile  =  '  '  ,  label  =  'ordinal- an ova. fl  '  , 

capt ion= ' Wald  statistics  from  the  proportional  odds  model', 
s ize =' smaller ' )  #  Table  14.5 

The  bottom  four  lines  of  the  table  are  the  most  important.  First,  there  is 
strong  evidence  that  some  associations  with  Y  exist  (45  d.f.  test)  and  very 
strong  evidence  of  nonlinearity  in  one  of  the  vital  signs  or  in  waz  (26  d.f.  test). 
There  is  moderately  strong  evidence  for  an  interaction  effect  somewhere  in  the 
model  (22  d.f.  test).  We  see  that  the  grouped  age  variable  ageg  is  predictive 
of  y,  but  mainly  as  an  effect  modifier  for  rr,  and  hrat.  temp  is  extremely 
nonlinear,  and  rr  is  moderately  so.  hrat,  a  difficult  variable  to  measure  reliably 
in  young  infants,  is  perhaps  not  important  enough  (y2  =  19,  9  d.f.)  to  keep 
in  the  final  model. 


14.6  Residual  Plots 

Section  13.3.4  defined  binary  logistic  score  residuals  for  isolating  the  PO 
assumption  in  an  ordinal  model.  For  the  tentative  PO  model,  score  residuals 
for  four  of  the  variables  were  plotted  using 

resid(fl  ,  'score. bin ary  '  ,  pi  =  TRUE  ,  whi ch  =  c (17,18,20,21)) 

##  Figure  14.3 

The  result  is  shown  in  Figure  14.3.  We  see  strong  evidence  of  non-PO  for 
ausc  and  moderate  evidence  for  drowsy  and  bul.conv,  in  agreement  with 
Figure  14.2. 


14.6  Residual  Plots 
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Table  14.5  Wald  statistics  from  the  proportional  odds  model 


X2 

d.f. 

P 

ageg  (Factor+Higher  Order  Factors) 

41.49 

24 

0.0147 

All  Interactions 

40.48 

22 

0.0095 

temp  (Factor+Higher  Order  Factors) 

37.08 

12 

0.0002 

All  Interactions 

6.77 

8 

0.5617 

Nonlinear  (Factor+Higher  Order  Factors) 

31.08 

9 

0.0003 

rr  (Factor+Higher  Order  Factors) 

81.16 

12 

< 

0.0001 

All  Interactions 

27.37 

8 

0.0006 

Nonlinear  (Factor+Higher  Order  Factors) 

27.36 

9 

0.0012 

hrat  (Factor+Higher  Order  Factors) 

19.00 

9 

0.0252 

All  Interactions 

8.83 

6 

0.1836 

Nonlinear  (Factor+Higher  Order  Factors) 

7.35 

6 

0.2901 

waz 

35.82 

3 

< 

0.0001 

Nonlinear 

13.21 

2 

0.0014 

bul.conv 

12.16 

1 

0.0005 

drowsy 

17.79 

1 

< 

0.0001 

agitated 

8.25 

1 

0.0041 

reffort 

63.39 

1 

< 

0.0001 

ausc 

105.82 

1 

< 

0.0001 

feeding 

30.38 

1 

< 

0.0001 

abdominal 

0.74 

1 

0.3895 

ageg  x  temp  (Factor+Higher  Order  Factors) 

6.77 

8 

0.5617 

Nonlinear 

6.40 

6 

0.3801 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

6.40 

6 

0.3801 

ageg  x  rr  (Factor+Higher  Order  Factors) 

27.37 

8 

0.0006 

Nonlinear 

14.85 

6 

0.0214 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

14.85 

6 

0.0214 

ageg  x  hrat  (Factor+Higher  Order  Factors) 

8.83 

6 

0.1836 

Nonlinear 

2.42 

4 

0.6587 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

2.42 

4 

0.6587 

TOTAL  NONLINEAR 

78.20 

26 

< 

0.0001 

TOTAL  INTERACTION 

40.48 

22 

0.0095 

TOTAL  NONLINEAR  +  INTERACTION 

96.31 

32 

< 

0.0001 

TOTAL 

1073.78 

45 

< 

0.0001 

Partial  residuals  computed  separately  for  each  Y-cutoff  (Section  13.3.4)  are 
the  most  useful  residuals  for  ordinal  models  as  they  simultaneously  check  lin¬ 
earity,  find  needed  transformations,  and  check  PO.  In  Figure  14.4,  smoothed 
partial  residual  plots  were  obtained  for  all  predictors,  after  first  fitting  a  sim¬ 
ple  model  in  which  every  predictor  was  assumed  to  operate  linearly.  Inter¬ 
actions  were  temporarily  ignored  and  age  was  used  as  a  continuous  variable. 


f2  V-  lrm(Y  ~  age  +  temp  +  rr  +  hrat  +  waz  + 

bul.conv  +  drowsy  +  agitated  +  reffort  +  ausc  + 
feeding  +  abdominal  ,  data  =  Sc  ,  x  =  TRUE  ,  y  =  TRUE) 
resid(f2,  'partial',  pl=TRUE ,  label . curve s=FALSE )  #  Figure  14.4 
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Fig.  14.3  Binary  logistic  model  score  residuals  for  binary  events  derived  from  two 
cutoffs  of  the  ordinal  response  Y.  Note  that  the  mean  residuals,  marked  with  closed 
circles,  correspond  closely  to  differences  between  solid  and  dashed  lines  at  F  =  1,  2 
in  Figure  14.2.  Score  residual  assessments  for  spline-expanded  variables  such  as  rr 
would  have  required  one  plot  per  d.f. 


The  degree  of  non-parallelism  generally  agreed  with  the  degree  of  non-flatness 
in  Figure  14.3  and  with  the  other  score  residual  plots  that  were  not  shown. 
The  partial  residuals  show  that  temp  is  highly  nonlinear  and  that  it  is  much 
more  useful  in  predicting  Y  =  2.  For  the  cluster  scores,  the  linearity  assump¬ 
tion  appears  reasonable,  except  possibly  for  drowsy.  Other  nonlinear  effects 
are  taken  into  account  using  splines  as  before  (except  for  age,  which  is  cate¬ 
gorized). 

A  model  can  have  significant  lack  of  fit  with  respect  to  some  of  the  predic¬ 
tors  and  still  yield  quite  accurate  predictions.  To  see  if  that  is  the  case  for  this 
PO  model,  we  computed  predicted  probabilities  of  Y  =  2  for  all  infants  from 
the  model  and  compared  these  with  predictions  from  a  customized  binary 
logistic  model  derived  to  predict  Pr(T  =  2).  The  mean  absolute  difference 
in  predicted  probabilities  between  the  two  models  is  only  0.02,  but  the  0.90 
quantile  of  that  difference  is  0.059.  For  high-risk  infants,  discrepancies  of  0.2 
were  common.  Therefore  we  elected  to  consider  a  different  model. 


14.7  Graphical  Assessment  of  Fit  of  CR  Model 

In  order  to  take  a  first  look  at  the  fit  of  a  CR  model,  let  us  consider  the 
two  binary  events  that  need  to  be  predicted,  and  assess  linearity  and  paral- 
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lelism  over  T-cutoffs.  Here  we  fit  a  sequence  of  binary  fits  and  then  use  the 
plot. lrm. partial  function,  which  assembles  partial  residuals  for  a  sequence 
of  fits  and  constructs  one  graph  per  predictor. 

1^ L 

crO  V-  lrm(Y==0  ~  age  +  temp  +  rr  +  hrat  +  waz  + 

bul.conv  +  drowsy  +  agitated  +  reffort  +  ausc  + 
feeding  +  abdominal  ,  data  =  Sc  ,  x  =  TRUE ,  y  =  TRUE) 

#  Use  the  update  function  to  save  repeating  model  right- 

#  hand  side.  An  indicator  variable  for  Y=1  is  the 

#  response  variable  below 

crl  V-  update  (crO  ,  Y  =  =  1  ~  .  ,  subset=Y  >  1) 

plot . lrm . part i al  (  crO  ,  crl,  center=TRUE)  #  Figure  14.5 

The  output  is  in  Figure  14.5.  There  is  not  much  more  parallelism  here  than 
in  Figure  14.4.  For  the  two  most  important  predictors,  ausc  and  rr,  there  are 
strongly  differing  effects  for  the  different  events  being  predicted  (e.g.,  Y  —  0 
or  Y  =  1|  Y  >  1).  As  is  often  the  case,  there  is  no  one  constant  f3  model  that 
satisfies  assumptions  with  respect  to  all  predictors  simultaneously,  especially 


age 


temp 


rr 


hrat 


LO 

o 


reffort 


ausc 


feeding 


abdominal 


Fig.  14.4  Smoothed  partial  residuals  corresponding  to  two  cutoffs  of  Y,  from  a  model 
in  which  all  predictors  were  assumed  to  operate  linearly  and  additively.  The  smoothed 
curves  estimate  the  actual  predictor  transformations  needed,  and  parallelism  relates 
to  the  PO  assumption.  Solid  lines  denote  Y  >  1  while  dashed  lines  denote  Y  >  2. 
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Fig.  14.5  loess  smoothed  partial  residual  plots  for  binary  models  that  are  compo¬ 
nents  of  an  ordinal  continuation  ratio  model.  Solid  lines  correspond  to  a  model  for 
Y  =  0,  and  dotted  lines  correspond  to  a  model  for  Y  =  1\Y  >  1. 


when  there  is  evidence  for  non-ordinality  for  ausc  in  Figure  14.2.  The  CR 
model  will  need  to  be  generalized  to  adequately  fit  this  dataset. 


14.8  Extended  Continuation  Ratio  Model 

The  CR  model  in  its  ordinary  form  has  no  advantage  over  the  PO  model  for 
this  dataset.  But  Section  13.4.6  discussed  how  the  CR  model  can  easily  be 
extended  to  relax  any  of  its  assumptions.  First  we  use  the  cr.  setup  function 
to  set  up  the  data  for  fitting  a  CR  model  using  the  binary  logistic  trick. 

L 

u  V-  cr . setup ( Y) 

Sc. expanded  V-  Sc[u$subs,  ] 
y  <-  u$y 

cohort  V-  u$cohort 


14.8  Extended  Continuation  Ratio  Model 


341 


Here  the  cohort  variable  has  values  5  all 5 ,  ,Y>=15  corresponding  to  the  condi¬ 
tioning  events  in  Equation  13.10.  Once  the  data  frame  is  expanded  to  include 
the  different  risk  cohorts,  vectors  such  as  age  are  lengthened  (to  5553  records). 
Now  we  fit  a  fully  extended  CR  model  that  makes  no  equal  slopes  assump¬ 
tions;  that  is,  the  model  has  to  fit  Y  assuming  the  covariables  are  linear  and 
additive.  At  this  point,  we  omit  hrat  but  add  back  all  variables  that  were 
deleted  by  examining  their  association  with  Y .  Recall  that  most  of  these 
seven  cluster  scores  were  summarized  using  PC\ .  Adding  back  “insignificant” 
variables  will  allow  us  to  validate  the  model  fairly  using  the  bootstrap,  as 
well  as  to  obtain  confidence  intervals  that  are  not  falsely  narrow.16 

full  ~ 

lrm(y  ~  cohort *( ageg *( res ( temp  ,  5)  +  res (rr ,5) )  + 

res (waz  ,4)  +  bul.conv  +  drowsy  +  agitated  +  reffort  + 

ausc  +  feeding  +  abdominal  +  hydration  +  hxprob  + 
pustular  +  crying  +  fever. ill  +  stop. breath  +  labor), 
data  =  Sc . expanded  ,  x  =  TRUE  ,  y  =  TRUE) 

#  x=TRUE ,  y=TRUE  are  for  pentrace,  validate,  calibrate  below 
perf  V-  f unct i on ( f it )  {  #  model  performance  for  Y=0 

pr  V-  predict  (fit  ,  type  ='  f  itted  ')[  cohort  ==  'all'] 
s  V-  round  ( somers2  (pr  ,  y[cohort  ==  'all']),  3) 

pr  V-  1  -  pr  #  Predict  Prob  [Y  >  0]  instead  of  Prob  [Y  =  0] 

f  V-  round  ( c  ( mean  ( pr  <  .05),  mean  (pr  >  .25), 

mean ( pr  >  .  5 ) )  ,  2) 

f  V-  paste  (f  [1]  ,  '  ,  '  ,  f  [2]  ,  '  ,  and  '  ,  f  [3]  ,  '  .  '  ,  sep=  '  '  ) 

list ( somers =s ,  fractions=f ) 

} 

perf. unpen  V-  perf  (full) 

print (full,  latex=TRUE,  coefs=5) 


Logistic  Regression  Model 

lrm(formula  =  y  cohort  *  (ageg  *  (res (temp,  5)  + 
rcs(rr,  5))  +  res (waz,  4)  +  bul.conv  +  drowsy  + 
agitated  +  reffort  +  ausc  +  feeding  +  abdominal  + 
hydration  +  hxprob  +  pustular  +  crying  +  fever. ill  + 
stop. breath  +  labor),  data  =  Sc. expanded,  x  =  TRUE, 
y  =  TRUE) 


Model  Likelihood 
Ratio  Test 

Discrimination 

Indexes 

Rank  Discrim. 
Indexes 

Obs  5553 

0  1512 

1  4041 

max  8 x  10-7 

LR  x2  1824.33 

d.f.  87 

Pr(>  x2)  <  0.0001 

H2  0.406 

g  1.677 

gr  5.350 

gp  0.269 

Brier  0.135 

C  0.843 

Dxy  0.685 

7  0.687 

Ta  0.272 
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Table  14.6  Wald  statistics  for  cohort  in  the  CR  model 

—  d.f.  P 

cohort  (Factor+Higher  Order  Factors)  199.47  44  <  0.0001 
All  Interactions  172.12  43  <  0.0001 

TOTAL  199.47  44  <  0.0001 


Coef  S.E.  Wald  Z  Pr(>  \Z\) 


Intercept 

1.3966 

9.0827 

0.15 

0.8778 

cohort =Y>1 

1.5077 

14.6443 

0.10 

0.9180 

ageg=[  7,60) 

-9.3715 

11.4104 

-0.82 

0.4115 

ageg=  [60,90] 

-26.4502 

17.2188 

-1.54 

0.1245 

temp 

-0.0049 

0.2551 

-0.02 

0.9846 

latex (anova (full ,  cohort),  f ile  =  '  '  ,  #  Table  14.6 

capt ion= ' Wald  statistics  for  \\co{cohort }  in  the  CR  model', 
s ize =' smaller  [2]  ',  label =' ordinal-anova . cohort  ') 


an  V-  anova  (full,  india  =  FALSE  ,  indnl=FALSE) 


L 


latex (an ,  file  =  '  '  ,  label =  'ordinal- an ova. full  '  , 

capt ion= ' Wald  statistics  for  the  continuation  ratio  model. 
Interactions  with  Wcofcohort }  assess  non-proportional 
hazards',  capt ion . lot =' Wald  statistics  for  $Y$  in  the 

continuation  ratio  model  '  , 
s ize =' smaller  [2]  ')  #  Table  14.7 

This  model  has  LR  y2  =  1824  with  87  d.f.  Wald  statistics  are  in  Tables  14.6 
and  14.7.  The  global  test  of  the  constant  slopes  assumption  in  the  CR  model 
(test  of  all  interactions  involving  cohort)  has  Wald  y2  =  172  with  43  d.f., 
P  0.0001.  Consistent  with  F'lguie  14.5,  the  formal  tests  indicate  that  ausc 
is  the  biggest  violator,  followed  by  waz  and  rr. 


14.9  Penalized  Estimation 

We  know  that  the  CR  model  must  be  extended  to  fit  these  data  adequately.  If 
the  model  is  fully  extended  to  allow  for  all  cohort  x  predictor  interactions,  we 
have  not  gained  any  precision  or  power  in  using  an  ordinal  model  over  using  a 
polytomous  logistic  model.  Therefore  we  seek  some  restrictions  on  the  model’s 
parameters.  The  lrm  and  pentrace  functions  allow  for  differing  A  for  shrinking 
different  types  of  terms  in  the  model.  Here  we  do  a  grid  search  to  determine 
the  optimum  penalty  for  simple  main  effect  (non-interaction)  terms  and  the 
penalty  for  interaction  terms,  most  of  which  are  terms  interacting  with  cohort 
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Table  14.7  Wald  statistics  for  the  continuation  ratio  model.  Interactions  with 
cohort  assess  non-proportional  hazards 


X2 

d.f. 

P 

cohort 

199.47 

44  <  0.0001 

ageg 

48.89 

36 

0.0742 

temp 

59.37 

24 

0.0001 

rr 

93.77 

24  <  0.0001 

waz 

39.69 

6  <  0.0001 

bul.conv 

10.80 

2 

0.0045 

drowsy 

15.19 

2 

0.0005 

agitated 

13.55 

2 

0.0011 

reffort 

51.85 

2  <  0.0001 

ausc 

109.80 

2  <  0.0001 

feeding 

27.47 

2  <  0.0001 

abdominal 

1.78 

2 

0.4106 

hydration 

4.47 

2 

0.1069 

hxprob 

6.62 

2 

0.0364 

pustular 

3.03 

2 

0.2194 

crying 

1.55 

2 

0.4604 

fever,  ill 

3.63 

2 

0.1630 

stop,  breath 

5.34 

2 

0.0693 

labor 

5.35 

2 

0.0690 

ageg  x  temp 

8.18 

16 

0.9432 

ageg  x  rr 

38.11 

16 

0.0015 

cohort  x  ageg 

14.88 

18 

0.6701 

cohort  x  temp 

8.77 

12 

0.7225 

cohort  X  rr 

19.67 

12 

0.0736 

cohort  X  waz 

9.04 

3 

0.0288 

cohort  X  bul.conv 

0.33 

1 

0.5658 

cohort  X  drowsy 

0.57 

1 

0.4489 

cohort  x  agitated 

0.55 

1 

0.4593 

cohort  x  reffort 

2.29 

1 

0.1298 

cohort  x  ausc 

38.11 

1  <  0.0001 

cohort  x  feeding 

2.48 

1 

0.1152 

cohort  x  abdominal 

0.09 

1 

0.7696 

cohort  X  hydration 

0.53 

1 

0.4682 

cohort  X  hxprob 

2.54 

1 

0.1109 

cohort  x  pustular 

2.40 

1 

0.1210 

cohort  X  crying 

0.39 

1 

0.5310 

cohort  x  fever. ill 

3.17 

1 

0.0749 

cohort  x  stop. breath 

2.99 

1 

0.0839 

cohort  X  labor 

0.05 

1 

0.8309 

cohort  x  ageg  x  temp 

2.22 

8 

0.9736 

cohort  x  ageg  x  rr 

10.22 

8 

0.2500 

TOTAL  NONLINEAR 

93.36 

40  <  0.0001 

TOTAL  INTERACTION 

203.10 

59  <  0.0001 

TOTAL  NONLINEAR  +  INTERACTION 

257.70 

67  <  0.0001 

TOTAL 

1211.73 

87  <  0.0001 
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to  allow  for  unequal  slopes.  The  following  code  uses  pentrace  on  the  full 
extended  CR  model  fit  to  find  the  optimum  penalty  factors.  All  combinations 
of  the  simple  and  interaction  As  for  which  the  interaction  penalty  >  the 
penalty  for  the  simple  parameters  are  examined. 

L 

d  V-  opt i ons ( digit s =4) 
pentrace (full  , 

list  (  s  imple  =  c(0,  .025  ,  .05  ,  .075  ,  .1)  , 

interaction  =  c(0 ,10 ,50  ,100 ,125  ,150))) 


Best  penalty  : 


simple  interaction  df 

0.05  125  49.75 


s imple 

interaction 

df 

ai  c 

bi  c 

aic  .  c 

0.000 

0 

87.00 

1650 

1074 

1648 

0.000 

10 

60.63 

1671 

1269 

1669 

0.025 

10 

60.11 

1672 

1274 

1670 

0.050 

10 

59.80 

1672 

1276 

1670 

0.075 

10 

59.58 

1671 

1277 

1670 

0.100 

10 

59.42 

1671 

1278 

1670 

0.000 

50 

54.64 

1671 

1309 

1670 

0.025 

50 

54. 14 

1672 

1313 

1671 

0.050 

50 

53.83 

1672 

1316 

1671 

0.075 

50 

53.62 

1672 

1317 

1671 

0.100 

50 

53.46 

1672 

1318 

1671 

0.000 

100 

51.61 

1672 

1330 

1671 

0.025 

100 

51.11 

1673 

1334 

1672 

0.050 

100 

50.81 

1673 

1336 

1672 

0.075 

100 

50.60 

1672 

1337 

1671 

0.100 

100 

50.44 

1672 

1338 

1671 

0.000 

125 

50.55 

1672 

1337 

1671 

0.025 

125 

50.05 

1673 

1341 

1672 

0.050 

125 

49.75 

1673 

1343 

1672 

0.075 

125 

49.54 

1672 

1344 

1672 

0.100 

125 

49.39 

1672 

1345 

1671 

0.000 

150 

49.65 

1672 

1343 

1671 

0.025 

150 

49.15 

1672 

1347 

1672 

0.050 

150 

48.85 

1673 

1349 

1672 

0.075 

150 

48.64 

1672 

1350 

1671 

0.100 

150 

48.49 

1672 

1351 

1671 

options (d) 

We  see  that  shrinkage  from  87  d.f.  down  to  49.75  effective  d.f.  results  in  an 
improvement  in  y2-scaled  AIC  of  23.  The  optimum  penalty  factors  were  0.05 
for  simple  terms  and  125  for  interaction  terms. 

Let  us  now  store  a  penalized  version  of  the  full  fit,  find  where  the  effective 
d.f.  were  reduced,  and  compute  y2  for  each  factor  in  the  model.  We  take 
the  effective  d.f.  for  a  collection  of  model  parameters  to  be  the  sum  of  the 
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diagonals  of  the  matrix  product  defined  underneath  Gray’s  Equation  2.923/ 
that  correspond  to  those  parameters. 

L 

f  ull  .  pen  <— 
update  (full  , 

penalty  =  list (s imple  =  .05  ,  interaction  =  125)) 
pr int ( full . pen ,  latex=TRUE,  coefs=FALSE) 


Logistic  Regression  Model 

lrm (formula  =  y  cohort  *  (ageg  *  (res (temp,  5)  +  rcs(rr,  5))  + 

rcs(waz,  4)  +  bul.conv  +  drowsy  +  agitated  +  ref fort  +  ausc  + 
feeding  +  abdominal  +  hydration  +  hxprob  +  pustular  +  crying  + 
fever. ill  +  stop. breath  +  labor),  data  =  Sc. expanded,  x  =  TRUE, 
y  =  TRUE,  penalty  =  list (simple  =  0.05,  interaction  =  125)) 


Penalty  factors 

simple  nonlinear  interaction  nonlinear . interaction 
0.05  0.05  125  125 


Model  Likelihood 
Ratio  Test 

Discrimination 

Indexes 

Rank  Discrim. 
Indexes 

Obs  5553 

0  1512 

1  4041 

max  1  x  10-7 

LR  x2  1772.11 

d.f.  49.75 

Pr(>  x2)  <  0.0001 

Penalty  21.48 

K2  0.392 

g  1.594 

gr  4.924 

gp  0.263 

Brier  0.136 

C  0.840 

Dxy  0.679 

7  0.681 

ra  0.269 

effective. df (full. pen) 

Original  and  Effective  Degrees  of  Freedom 


Original  Penalized 


All 

Simple  Terms 

Interaction  or  Nonlinear 

Nonlinear 

Interaction 

Nonlinear  Interaction 


87 

20 

67 

40 

59 

32 


49.75 

19.98 

29.77 

16.82 

22.57 

9.62 


##  Compute  discrimination  for  Y=0  vs.  Y>0 
perf.pen  <—  perf ( full . pen )  #  Figure  14.6 

#  Exclude  interactions  and  cohort  effects  from  plot 
plot  ( anova ( full . pen  )  ,  cex . labels =0 . 75  ,  rm.ia  =  TRUE, 

rm . other =' cohort  ( Factor +Higher  Order  Factors)  ') 
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ageg 
fever,  ill 
crying 
pustular 
abdominal 
hydration 
stop,  breath 
labor 
hxprob 
bul.conv 
agitated 
drowsy 
temp 
feeding 
waz 
rr 

ref  fort 
ausc 


Fig.  14.6  Importance  of  predictors  in  full  penalized  model,  as  judged  by  partial 
Wald  x2  minus  the  predictor  d.f.  The  Wald  y2  values  for  each  line  in  the  dot  plot 
include  contributions  from  all  higher-order  effects.  Interaction  effects  by  themselves 
have  been  removed  as  has  the  cohort  effect. 


This  will  be  the  final  model  except  for  the  model  used  in  Section  14.10. 
The  model  has  LR  x2  =  1772.  The  output  of  effective.df  shows  that  non¬ 
interaction  terms  have  barely  been  penalized,  and  coefficients  of  interaction 
terms  have  been  shrunken  from  59  d.f.  to  effectively  22.6  d.f.  Predictive  dis¬ 
crimination  was  assessed  by  computing  the  Somers’  Dxy  rank  correlation 
/\ 

between  X(3  and  whether  Y  —  0,  in  the  subset  of  records  for  which  Y  —  0  is 
what  was  being  predicted.  Here  Dxy  =  0.672,  and  the  ROC  area  is  0.838  (the 
unpenalized  model  had  an  apparent  Dxy  =  0.676).  To  summarize  in  another 
way  the  effectiveness  of  this  model  in  screening  infants  for  risks  of  any  abnor¬ 
mality,  the  fraction  of  infants  with  predicted  probabilities  that  Y  >  0  being 
<  0.05,  >  0.25,  and  >  0.5  are,  respectively,  0.1,  0.28,  and  0.14.  anova  output  is 
plotted  in  Figure  14.6  to  give  a  snapshot  of  the  importance  of  the  various  pre¬ 
dictors.  The  Wald  statistics  used  here  are  computed  on  a  variance-covariance 
matrix  which  is  adjusted  for  penalization  (using  Gray  Equation  2.6237  before 
it  was  determined  that  the  sandwich  covariance  estimator  performs  less  well 
than  the  inverse  of  the  penalized  information  matrix — see  p.  211). 

The  full  equation  for  the  fitted  model  is  below.  Only  the  part  of  the  equa¬ 
tion  used  for  predicting  Pr(T  =  0)  is  shown,  other  than  an  intercept  for 
Y  >  1  that  does  not  apply  when  Y  =  0. 

latex (full. pen  ,  which  =1:21,  f ile  =  '  '  ) 
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xp  = 

—  1.337435[Y  >=  1] 

+0.1074525[ageg  e  [  7,60)]  +  0.1971287[ageg  e  [60,90]] 

+0. 1978706temp  +  0.1091831(temp  -  36.19998)5.  -  2.833442(temp  -  37)5. 
+5.07114(temp  -  37.29999)5  -  2.507527(temp  -  37.69998)5 
+0.1606456(temp  -  39)5 

+0.02090741rr  -  6.336873  x  10-5 (rr  -  32)5  +  8.405441  x  10-5  (rr  -  42)5 
+6.152416  xl0_5(rr  -49)5  -  0.0001018105(rr  -  59)5  +  1.960063  x  10_5(rr  -  76)5 
— 0.07589699waz  +  0.02508918(waz  +  2.9)5  -  0.1185068(waz  +  0.75)5 
+0.1225752(waz  -  0.28)5  -  0.02915754(waz  -  1.73)5  -  0.4418073  bul.conv 
—0.08185088  drowsy  —  0.05327209  agitated  —  0.2304409  reffort 

—  1.158604  ausc  —  0.1599588  feeding  —  0.1608684  abdominal 
—0.05409718  hydration  +  0.08086387  hxprob  +  0.007519746  pustular 
+0.04712091  crying  +  0.004298725  fever. ill  —  0.3519033  stop. breath 
+0.06863879  labor 

+  [ageg  €  [  7,  60)]  [6.499592  xlO-5  temp  -  0.00279976(temp  -  36.19998)5 
-0.008691 166(temp  -  37)5  -  0.004987871(temp  -  37.29999)5 
+0.0259236(temp  -  37.69998)5  -  0.009444801(temp  -  39)^_] 

+  [ageg  €  [60,  90]]  [0.0001320368temp  -  0.00182639(temp  -  36.19998)5 
— 0.01640406(temp  -  37)5  -  0.0476041(temp  -  37.29999)5 
+0.09142148(temp  -  37.69998)5  -  0.02558693(temp  -  39)+] 

+  [ageg  €  [  7,  60)]  [— 0.0009437598rr  -  1.044673  x  10-6  (it  -  32)5 

—  1.670499 x  10_6(rr  —  42)+  -  5.189082 x  10-6(rr  -  49)5  +  1.428634 x  10-5(rr  -  59)5 
—6.382087 x  10-6(rr  -  76)^_] 

+  [ageg  €  [60,  90]]  [-0.00192081  lrr  -  5.52134  x  10_6(rr  -  32)5 

—8.628392  x  10-6 (rr  —  42)+  -  4. 147347  x  10_6(rr  -  49)5  +  3.813427  x  10_5(rr  -  59)5 
-1.98372  x  10_5(rr  -  76)^_] 

where  [c]  =  1  if  subject  is  in  group  c,  0  otherwise;  (ai)  +  =  x  if  x  >  0,  0  otherwise. 

Now  consider  displays  of  the  shapes  of  effects  of  the  predictors.  For  the 
continuous  variables  temp  and  rr  that  interact  with  age  group,  we  show  the 
effects  for  all  three  age  groups  separately  for  each  Y  cutoff.  All  effects  have 
been  centered  so  that  the  log  odds  at  the  median  predictor  value  is  zero 
when  cohort=’all’ ,  so  these  plots  actually  show  log  odds  relative  to  reference 
values.  The  patterns  in  Figures  14.9  and  14.8  are  in  agreement  with  those  in 
Figure  14.5. 
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yi  c(-3,  1)  #  put  all  plots  on  common  y-axis  scale 

#  Plot  predictors  that  interact  with  another  predictor 

#  Vary  ageg  over  all  age  groups ,  then  vary  temp  over  its 

#  default  range  (10th  smallest  to  10th  largest  values  in 

#  data).  Make  a  separate  plot  for  each  ' cohort ' 

#  ref. zero  centers  effects  using  median  x 

dd  V-  dat  adi  st  (  S  c  .  expanded  )  ;  dd  V-  datadist(dd,  cohort) 
options  (datadist  =  'dd  ') 

pi  V-  Predi ct ( full . pen  ,  temp,  ageg,  cohort, 

r ef . zer o =TRUE  ,  conf . int =FALSE ) 

P2  V-  Predi ct ( full . pen  ,  rr  ,  ageg,  cohort, 

r ef . zer o =TRUE  ,  conf . int =FALSE ) 
p  <—  rbind ( t emp  =  pl  ,  rr  =  p2)  #  Figure  14.7; 

source (paste (  'http : //biostat .me .Vanderbilt . edu/wiki/ pub /Main  '  , 

'RConfiguration/graphicsSet.r '  ,  sep  =  '/  ')) 
ggplot (p  ,  cohort  ,  groups= ' ageg  '  ,  varypr ed =TRUE , 

ylim=yl ,  layout=c(2,  1),  1 egend . po s i t i on= c ( . 85 , . 8 ) , 

addlayer =ltheme ( width=3 ,  height=3,  text=2.5,  title=2.5), 

adj . subt itle=FALSE)  #  Itheme  defined  with  sourceO 
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TRUE) 

yi 

c (-1 . 5  ,  1.5) 

ggplo 

t ( yeql ,  ylim=yl , 

sepdiscre 

t  e  = ' ve 

rt 

i 

cal 

') 

# 

Fi 

gure  14.8 

dd$limits  ['Adjust  to',  'cohort']  V-  'all'  #  original  default 

all  V-  Predi ct ( full . pen ,  name=v,  r ef . zer o =TRUE ) 

ggplot  (all,  ylim  =  yl  ,  sepdiscrete = ' vert ical  '  )  #  Figure  14.9 


14.10  Using  Approximations  to  Simplify  the  Model 

Parsimonious  models  can  be  developed  by  approximating  predictions  from 
the  model  to  any  desired  level  of  accuracy.  Let  L  =  X/3  denote  the  predicted 
log  odds  from  the  full  penalized  ordinal  model,  including  multiple  records  for 

subjects  with  Y  >  0.  Then  we  can  use  a  variety  of  techniques  to  approximate 

/\ 

L  from  a  subset  of  the  predictors  (in  their  raw  form).  With  this  approach 
one  can  immediately  see  what  is  lost  over  the  full  model  by  computing,  for 
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example,  the  mean  absolute  error  in  predicting  L.  Another  advantage  to  full 

/\ 

model  approximation  is  that  shrinkage  used  in  computing  L  is  inherited  by 

any  model  that  predicts  L.  In  contrast,  the  usual  stepwise  methods  result  in 

/\ 

/ 3  that  are  too  large  since  the  final  coefficients  are  estimated  as  if  the  model 
structure  were  prespecified. 

CART  would  be  particularly  useful  as  a  model  approximator  as  it  would 
result  in  a  prediction  tree  that  would  be  easy  for  health  workers  to  use. 


~o 

~o 

o 

CD 

o 


Temperature 


Adjusted  respiratory  rate 


Fig.  14.7  Centered  effects  of  predictors  on  the  log  odds,  showing  the  effects  of  two 
predictors  with  interaction  effects  for  the  age  intervals  noted.  The  title  all  refers 
to  the  prediction  of  7  =  0|T  >  0,  that  is,  Y  =  0.  Y>=1  refers  to  predicting  the 
probability  of  Y  =  1\Y  >  1. 
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Fig.  14.8  Centered  effects  of  predictors  on  the  log  odds,  for  predicting  Y  =  1\Y  >  1 
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Unfortunately,  a  50-node  CART  was  required  to  predict  L  with  an  R  >  0.9, 
and  the  mean  absolute  error  in  the  predicted  logit  was  still  0.4.  This  will 
happen  when  the  model  contains  many  important  continuous  variables. 

Let’s  approximate  the  full  model  using  its  important  components,  by  using 

/s 

a  step-down  technique  predicting  L  from  all  of  the  component  variables  using 
ordinary  least  squares.  In  using  step-down  with  the  least  squares  function  ols 
in  rms  there  is  a  problem  when  the  initial  R2  =  1.0  as  in  that  case  the  esti¬ 
mate  of  a  =  0.  This  can  be  circumvented  by  specifying  an  arbitrary  nonzero 
value  of  a  to  ols  (here  1.0),  as  we  are  not  using  the  variance-covariance 
matrix  from  ols  anyway.  Since  cohort  interacts  with  the  predictors,  separate 
approximations  can  be  developed  for  each  level  of  Y.  For  this  example  we 
approximate  the  log  odds  that  Y  =  0  using  the  cohort  of  patients  used  for 
determining  Y  —  0,  that  is,  Y  >  0  or  cohorts  all’ . 
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Fig.  14.9  Centered  effects  of  predictors  on  the  log  odds,  for  predicting  Y  >  1.  No 
plot  was  made  for  the  fever,  ill,  stop. breath,  or  labor  cluster  scores. 


plogit  predict ( full . pen ) 

f  ols(plogit  ~  ageg* (res (temp  ,  5)  +  res (rr  ,5) )  + 

res (waz ,4)  +  bul.conv  +  drowsy  +  agitated  + 

reffort  +  ausc  +  feeding  +  abdominal  +  hydration  + 
hxprob  +  pustular  +  crying  +  fever. ill  + 
stop. breath  +  labor, 

subset = cohort ==' all  '  ,  data  =  Sc . expanded  ,  sigma  =  l) 

#  Do  fast  backward  stepdown 
w  opt i ons ( width = 1 20 ) 
fastbw(f,  aics=lel0) 
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Deleted 
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Approximate  Estimates  after  Deleting  Factors 


Coef  S.E.  Wald  Z  P 
[1  ,]  1.617  0.01482  109.1  0 

Factors  in  Final  Model 

None 


options (w) 

#  lelO  causes  all  variables  to  eventually  be 

#  deleted  so  can  see  most  important  ones  in  order 

#  Fit  an  approximation  to  the  full  penalized  model  using 

#  most  important  variables 
f  ull  .  approx  A- 

ols(plogit  rsj  rcs(temp,5)  +  ageg*rcs (rr , 5)  + 

res (waz  ,4)  +  bul.conv  +  drowsy  +  reffort  + 

ausc  +  feeding  , 

subset = cohort  =  =  'all  '  ,  data  =  Sc  .  expanded  ) 
p  A-  predict ( full . approx ) 

abserr  A-  mean (abs (p  -  plogit  [  cohort  ==  'all'])) 

Dxy  A-  somers2(p,  y[cohort  ==  '  all  '  ]  )  [  '  Dxy  '  ] 

The  approximate  model  had  R2  against  the  full  penalized  model  of  0.972,  and 
the  mean  absolute  error  in  predicting  L  was  0.17.  The  Dxy  rank  correlation 
between  the  approximate  model’s  predicted  logit  and  the  binary  event  Y  =  0 
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is  0.665  as  compared  with  the  full  model’s  Dxy  =  0.672.  See  Section  19.5  for 
an  example  of  computing  correct  estimates  of  variance  of  the  parameters  in 
an  approximate  model. 

Next  turn  to  diagramming  this  model  approximation  so  that  all  predicted 
values  can  be  computed  without  the  use  of  a  computer.  We  draw  a  type  of 
nomogram  that  converts  each  effect  in  the  model  to  a  0  to  100  scale  which  is 
just  proportional  to  the  log  odds.  These  points  are  added  across  predictors 
to  derive  the  “Total  Points,”  which  are  converted  to  L  and  then  to  predicted 
probabilities.  For  the  interaction  between  rr  and  ageg,  rms’s  nomogram  func¬ 
tion  automatically  constructs  three  rr  axes — only  one  is  added  into  the  total 
point  score  for  a  given  subject.  Here  we  draw  a  nomogram  for  predicting  the 

probability  that  Y  >  0,  which  is  1  —  Pr(V  =  0).  This  probability  is  derived 

/\  /\ 

by  negating  f3  and  X/3  in  the  model  derived  to  predict  Pr(T  =  0). 
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The  nomogram  is  shown  in  Figure  14.10.  As  an  example  in  using  the  nomo¬ 
gram,  a  six-day-old  infant  gets  approximately  9  points  for  having  a  respiration 
rate  of  30/minute,  19  points  for  having  a  temperature  of  39°C,  11  points  for 
waz=0,  14  points  for  drowsy=5,  and  15  points  for  reffort=2.  Assuming  that 
bul .  conv=ausc=f  eeding=0,  that  infant  gets  68  total  points.  This  corresponds  to 
X/3  =  —0.68  and  a  probability  of  0.34. 


14.11  Validating  the  Model 

For  the  full  CR  model  that  was  fitted  using  penalized  maximum  likelihood 
estimation  (PMLE),  we  used  200  bootstrap  replications  to  estimate  and  then 
to  correct  for  optimism  in  various  statistical  indexes:  Dxy ,  generalized  R2, 

intercept  and  slope  of  a  linear  re-calibration  equation  for  X/3,  the  maximum 
calibration  error  for  Pr(V  =  0)  based  on  the  linear-logistic  re-calibration 
(Emax),  and  the  Brier  quadratic  probability  score  B.  PMLE  is  used  at  each 
of  the  200  resamples.  During  the  bootstrap  simulations,  we  sample  with 
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Fig.  14.10  Nomogram  for  predicting  Pr(T  >  0)  from  the  penalized  extended  CR 
model,  using  an  approximate  model  fitted  using  ordinary  least  squares  ( R 2  =  0.972 
against  the  full  model’s  predicted  logits). 


replacement  from  the  patients  and  not  from  the  5553  expanded  records ,  hence 
the  specification  cluster=u$subs,  where  u$subs  is  the  vector  of  sequential  pa¬ 
tient  numbers  computed  from  cr.  setup  above.  To  be  able  to  assess  predictive 
accuracy  of  a  single  predicted  probability,  the  subset  parameter  is  specified 
so  that  Pr(T  =  0)  is  being  assessed  even  though  5553  observations  are  used 
to  develop  each  of  the  200  models. 

L 

set.seed(l)  #  so  can  reproduce  results 
v  validate ( full . pen ,  B=200 ,  cluster =u$ subs , 

subset  =  cohort  ==  'all  ') 

latex (v ,  f ile  =  '  '  ,  digits=2,  size= 1  smaller  1 ) 
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Index 

Original  Training 
Sample  Sample 

Test 

Sample 

Optimism 

Corrected  n 
Index 

DXy 

0.67 

0.68 

0.67 

0.01 

0.66  200 

R2 

0.38 

0.38 

0.37 

0.01 

0.36  200 

Intercept 

-0.03 

-0.03 

0.00 

-0.03 

0.00  200 

Slope 

1.03 

1.03 

1.00 

0.03 

1.00  200 

Emax 

0.00 

0.00 

0.00 

0.00 

0.00  200 

D 

0.28 

0.29 

0.28 

0.01 

0.27  200 

U 

0.00 

0.00 

0.00 

0.00 

0.00  200 

Q 

0.28 

0.29 

0.28 

0.01 

0.27  200 

B 

0.12 

0.12 

0.12 

0.00 

0.12  200 

9 

1.47 

1.50 

1.45 

0.04 

1.42  200 

9p 

0.22 

0.23 

0.22 

0.00 

0.22  200 

v  round ( v ,  3) 

We  see  that  for  the  apparent  Dxy  =  0.672  and  that  the  optimism  from 
overfitting  was  estimated  to  be  0.011  for  the  PMLE  model,  so  the  bias- 

corrected  estimate  of  predictive  discrimination  is  0.661.  The  intercept  and 

/\ 

slope  needed  to  re-calibrate  X/3  to  a  45°  line  are  very  near  (0,  1).  The  es¬ 
timate  of  the  maximum  calibration  error  in  predicting  Pr(F  =  0)  is  0.001, 
which  is  quite  satisfactory.  The  corrected  Brier  score  is  0.122. 

The  simple  calibration  statistics  just  listed  do  not  address  the  issue  of 
whether  predicted  values  from  the  model  are  miscalibrated  in  a  nonlinear 
way,  so  now  we  estimate  an  overfitting-corrected  calibration  curve  nonpara- 
metrically. 

L 

cal  calibrate ( full . pen ,  B=200 ,  cluster =u$ subs , 

subset  =  cohort  ==  'all  ') 
err  plot  (cal)  #  Figure  14.11 


n  =  5553  Mean  absolute  error  =0.017  Mean  squared  error  =0.00043 

0.9  Quantile  of  absolute  error=0.038 


The  results  are  shown  in  Figure  14.11.  One  can  see  a  slightly  nonlinear  cali¬ 
bration  function  estimate,  but  the  overfitting-corrected  calibration  is  excellent 
everywhere,  being  only  slightly  worse  than  the  apparent  calibration.  The  esti¬ 
mated  maximum  calibration  error  is  0.044.  The  excellent  validation  for  both 
predictive  discrimination  and  calibration  are  a  result  of  the  large  sample  size, 
frequency  distribution  of  F,  initial  data  reduction,  and  PMLE. 


14.12  Summary 

Clinically  guided  variable  clustering  and  item  weighting  resulted  in  a  great 
reduction  in  the  number  of  candidate  predictor  degrees  of  freedom  and  hence 
increased  the  true  predictive  accuracy  of  the  model.  Scores  summarizing  clus¬ 
ters  of  clinical  signs,  along  with  temperature,  respiration  rate,  and  weight- 
for-age  after  suitable  nonlinear  transformation  and  allowance  for  interactions 
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Fig.  14.11  Bootstrap  calibration  curve  for  the  full  penalized  extended  CR  model. 
200  bootstrap  repetitions  were  used  in  conjunction  with  the  loess  smoother.111  Also 
shown  is  a  “rug  plot”  to  demonstrate  how  effective  this  model  is  in  discriminating 
patients  into  low-  and  high-risk  groups  for  Pr(T  =  0)  (which  corresponds  with  the 
derived  variable  value  y  —  1  when  cohort=5 all 5 ). 


with  age,  are  powerful  predictors  of  the  ordinal  response.  Graphical  methods 
are  effective  for  detecting  lack  of  fit  in  the  PO  and  CR  models  and  for  dia¬ 
gramming  the  final  model.  Model  approximation  allowed  development  of  par¬ 
simonious  clinical  prediction  tools.  Approximate  models  inherit  the  shrinkage 
from  the  full  model.  For  the  ordinal  model  developed  here,  substantial  shrink¬ 
age  of  the  full  model  was  needed. 


14.13  Further  Reading 


i 


3 


See  Moons  et  al.462  for  another  case  study  in  penalized  maximum  likelihood 
estimation. 

The  lasso  method  of  Tibshirani608,609  also  incorporates  shrinkage  into  variable 
selection. 

To  see  how  this  compares  with  predictions  using  the  full  model,  the  extra  clinical 
signs  in  that  model  that  are  not  in  the  approximate  model  were  predicted 

A 

individually  on  the  basis  of  X (3  from  the  reduced  model  along  with  the  signs 
that  are  in  that  model,  using  ordinary  linear  regression.  The  signs  not  specified 
when  evaluating  the  approximate  model  were  then  set  to  predicted  values  based 

A 

on  the  values  given  for  the  6-day-old  infant  above.  The  resulting  X/3  for  the  full 
model  is  —0.81  and  the  predicted  probability  is  0.31,  as  compared  with  -0.68 
and  0.34  quoted  above. 


14.14  Problems 
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14.14  Problems 

Develop  a  proportional  odds  ordinal  logistic  model  predicting  the  severity 
of  functional  disability  (sfdm2)  in  SUPPORT.  The  highest  level  of  this  vari¬ 
able  corresponds  to  patients  dying  before  the  two-month  follow-up  interviews. 
Consider  this  level  as  the  most  severe  outcome.  Consider  the  following  pre¬ 
dictors:  age,  sex,  dzgroup,  num.co,  scoma,  race  (use  all  levels),  meanbp,  hrt, 
temp,  pafi,  alb,  adlsc.  The  last  variable  is  the  baseline  level  of  functional 
disability  from  the  “activities  of  daily  living  scale.” 

1.  For  the  variables  adlsc,  sex,  age,  meanbp,  and  others  if  you  like,  make 
plots  of  means  of  predictors  stratified  by  levels  of  the  response,  to  check 
for  ordinality.  On  the  same  plot,  show  estimates  of  means  assuming  the  pro¬ 
portional  odds  relationship  between  predictors  and  response  holds.  Com¬ 
ment  on  the  evidence  for  ordinality  and  for  proportional  odds. 

2.  To  allow  for  maximum  adjustment  of  baseline  functional  status,  treat 
this  predictor  as  nominal  (after  rounding  it  to  the  nearest  whole  num¬ 
ber;  fractional  values  are  the  result  of  imputation)  in  remaining  steps,  so 
that  all  dummy  variables  will  be  generated.  Make  a  single  chart  showing 
proportions  of  various  outcomes  stratified  (individually)  by  adlsc,  sex, 
age,  meanbp.  For  continuous  predictors  use  quartiles.  You  can  pass  the  fol¬ 
lowing  function  to  the  summary  (summary .formula)  function  to  obtain  the 
proportions  of  patients  having  sfdm2  at  or  worse  than  each  of  its  possi¬ 
ble  levels  (other  than  the  first  level).  An  easy  way  to  do  this  is  to  use 
the  cumcategory  function  with  the  Hmisc  package’s  summary  .formula  func¬ 
tion.  cumcategorysummary .formula  Print  estimates  to  only  two  significant 
digits  of  precision.  Manually  check  the  calculations  for  the  sex  variable 
using  table  (sex,  sfdm2).  Then  plot  all  estimates  on  a  single  graph  using 
plot  (object,  which=l:4),  where  object  was  created  by  summary  (actually 
summary  .formula).  Note:  for  printing  tables  you  may  want  to  convert  sfdm2 
to  a  0-4  variable  so  that  column  headers  are  short  and  so  that  later  cal¬ 
culations  are  simpler.  You  can  use  for  example: 

sfdm  as . int eger ( sf dm2 )  -  1 

3.  Use  an  R  function  such  as  the  following  to  compute  the  logits  of  the  cu¬ 
mulative  proportions. 


sf  V-  function(y) 

i 

c(  'Y  >  1  ' =  qlogi s  (mean (y 

> 

D)  , 

' Y > 2 '=qlogis (mean (y 

> 

2))  , 

' Y > 3 '=qlogis (mean (y 

> 

3))  , 

' Y > 4 '=  qlogi s (mean (y 

> 

4))) 

As  the  Y  =  3  category  is  rare,  it  may  be  even  better  to  omit  the  Y  >  4 
column  above,  as  was  done  in  Section  13.3.9  and  Figure  13.1.  For  each 
predictor  pick  two  rows  of  the  summary  table  having  reasonable  sample 
sizes,  and  take  the  difference  between  the  two  rows.  Comment  on  the 
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validity  of  the  proportional  odds  assumption  by  assessing  how  constant 
the  row  differences  are  across  columns.  Note:  constant  differences  in  log 
odds  (logits)  mean  constant  ratios  of  odds  or  constant  relative  effects  of 
the  predictor  across  outcome  levels. 

4.  Make  two  plots  nonparametrically  relating  age  to  all  of  the  cumulative 
proportions  or  their  logits.  You  can  use  commands  such  as  the  following 
(to  use  the  R  Hmisc  package). 


f  or  ( 

i  in  1:4) 

L 

pi 

smo ( age  , 

sf  dm 

>  i 

add  =  i >1  , 

ylim  = 

c  (  .  2  , 

.8)  , 

ylab  = ' Proport 

ion  Y  >  j  '  ) 

f  or  ( 

i  in  1:4) 

pi 

smo ( age  , 

sf  dm 

>  i 

add=i >1 ,  fun 

=qlogis , 

yl  im  = 

qlogi 

s  (  c  ( 

2 ,  . 8 ) )  ,  ylab  = 

'logit  '  ) 

Comment  on  the  linearity  of  the  age  effect  (which  of  the  two  plots  do 
you  use?)  and  on  the  proportional  odds  assumption  for  age,  by  assessing 
parallelism  in  the  second  plot. 

5.  Impute  race  using  the  most  frequent  category  and  pafi  and  alb  using 
“normal”  values. 

6.  Fit  a  model  to  predict  the  ordinal  response  using  all  predictors.  For  con¬ 
tinuous  ones  assume  a  smooth  relationship  but  allow  it  to  be  nonlinear. 
Quantify  the  ability  of  the  model  to  discriminate  patients  in  the  five  out¬ 
comes.  Do  an  overall  likelihood  ratio  test  for  whether  any  variables  are 
associated  with  the  level  of  functional  disability. 

7.  Compute  partial  tests  of  association  for  each  predictor  and  a  test  of  nonlin¬ 
earity  for  continuous  ones.  Compute  a  global  test  of  nonlinearity.  Graphi¬ 
cally  display  the  ranking  of  importance  of  the  predictors. 

8.  Display  the  shape  of  how  each  predictor  relates  to  the  log  odds  of  exceeding 
any  level  of  sfdm2  you  choose,  setting  other  predictors  to  typical  values 
(one  value  per  predictor).  By  default,  Predict  will  make  predictions  for 
the  second  response  category,  which  is  a  satisfactory  choice  here. 

9.  Use  resampling  to  validate  the  Somers’  Dxy  rank  correlation  between  pre¬ 
dicted  logit  and  the  ordinal  outcome.  Also  validate  the  generalized  i?2, 
and  slope  shrinkage  coefficient,  all  using  a  single  R  command.  Comment 
on  the  quality  (potential  “export-ability”)  of  the  model. 


Chapter  15 

Regression  Models  for  Continuous  Y 
and  Case  Study  in  Ordinal  Regression 


This  chapter  concerns  univariate  continuous  Y.  There  are  many  multivariable 
models  for  predicting  such  response  variables,  such  as 

•  linear  models  with  assumed  normal  residuals,  fitted  with  ordinary  least 
squares 

•  generalized  linear  models  and  other  parametric  models  based  on  special 
distributions  such  as  the  gamma 

•  generalized  additive  models  (G AMs) 277 

•  generalization  of  GAMs  to  also  nonparametrically  transform  Y  (see 
Chapter  16) 

•  quantile  regression  (see  Section  15.2) 

•  other  robust  regression  models  that,  like  quantile  regression,  use  an  objec¬ 
tive  different  from  minimizing  the  sum  of  squared  errors635 

•  semiparametric  models  based  on  the  ranks  of  T,  such  as  the  Cox  pro¬ 
portional  hazards  model  (Chapter  20)  and  the  proportional  odds  ordinal 
logistic  model  (Chapters  13  and  14) 

•  cumulative  probability  models  (often  called  cumulative  link  models )  which 
are  semiparametric  models  from  a  wider  class  of  families  than  the  logistic. 

Semiparametric  models  that  treat  Y  as  ordinal  but  not  interval-scaled  have 
many  advantages  including  robustness  and  freedom  from  all  distributional 
assumptions  for  Y  conditional  on  any  given  set  of  predictors.  Advantages 
are  demonstrated  in  a  case  study  of  a  cumulative  probability  ordinal  model. 
Some  of  the  results  are  compared  to  quantile  regression  and  OLS.  Many  of 
the  methods  used  in  the  case  study  also  apply  to  ordinary  linear  models. 


15.1  The  Linear  Model 

The  most  popular  multivariable  model  for  analyzing  a  univariate  continuous 
Y  is  the  linear  model 
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E(Y\X)  =  Xfi, 


(15.1) 


where  /?  is  estimated  using  ordinary  least  squares,  that  is,  by  solving  for  /?  to 
minimize  —  Xf3)2. 

To  compute  P-values  and  confidence  limits  using  parametric  methods  we 
would  have  to  assume  that  Y\X  is  normal  with  mean  X/3  and  constant  vari¬ 
ance  cr2a.  One  could  estimate  conditional  means  of  Y  without  any  distribu¬ 
tional  assumptions,  but  least  squares  estimators  are  not  robust  to  outliers  or 
high-leverage  points,  and  the  model  would  be  inaccurate  in  estimating  condi¬ 
tional  quantiles  ofY\X  or  Prob[T  >  c\X]  unless  normality  of  residuals  holds. 
To  be  accurate  in  estimating  all  quantities,  the  linear  model  assumes  that 
the  Gaussian  distribution  of  Y\Xi  is  a  simple  shift  from  the  distribution  of 

Y\X2. 


15.2  Quantile  Regression 

Quantile  regression  55,35  is  a  different  approach  to  modeling  Y.  It  makes  no 
distributional  assumptions  other  than  continuity  of  T,  while  having  all  the 
usual  right  hand  side  assumptions.  Quantile  regression  provides  essentially 
the  same  estimates  as  sample  quantiles  if  there  is  only  an  intercept  or  a  cate¬ 
gorical  predictor  in  the  model.  Quantile  regression  is  transformation  invariant 
—  pre-transforming  Y  is  not  important. 

Quantile  regression  is  a  natural  generalization  of  sample  quantiles.  Let 
pr(y)  =  y{r  -  [y  <  0]).  The  rth  sample  quantile  is  the  minimizer  q  of 
Y17-1  ~  q)-  F°r  a  conditional  rth  quantile  of  Y\X  the  corresponding 

quantile  regression  estimator  /3r  minimizes  JT=1  pr(Yi  —  X (3). 

In  non-large  samples,  quantile  regression  is  not  as  efficient  at  estimating 
quantiles  as  is  ordinary  least  squares  at  estimating  the  mean,  if  the  latter’s 
assumptions  hold. 

Koenker’s  quantreg  package  in  R  implements  quantile  regression,  and 
the  rms  package’s  Rq  function  provides  a  front-end  that  gives  rise  to  various 
graphics  and  inference  tools. 

Using  quantile  regression,  we  directly  model  the  median  as  a  function 
of  covariates  so  that  only  the  X/3  structure  need  be  correct.  Other  quantiles 
(e.g.,  99th  percentile)  can  be  modeled  but  standard  errors  will  be  much  larger 
as  it  is  more  difficult  to  precisely  estimate  outer  quantiles. 

a  The  latter  assumption  may  be  dispensed  with  if  we  use  a  robust  Huber-White  or 
bootstrap  covariance  matrix  estimate.  Normality  may  sometimes  be  dispensed  with 
by  using  bootstrap  confidence  intervals. 


15.3  Ordinal  Regression  Models  for  Continuous  Y 
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A  different  robust  semiparametric  regression  approach  than  quantile  regres¬ 
sion  is  the  cumulative  probability  ordinal  model.  Semiparametric  models 
have  several  advantages  over  parametric  models  such  as  OLS.  While  quantile 
regression  has  no  restriction  in  the  parameters  when  modeling  one  quantile 
versus  another13,  ordinal  cumulative  probability  models  assume  a  connection 
between  distributions  of  Y  for  different  X.  Ordinal  regression  even  makes 
one  less  assumption  than  quantile  regression  about  the  distribution  of  Y  for 
a  specific  X :  the  distribution  need  not  be  continuous. 

Applying  an  increasing  1-1  transformation  to  Y  results  in  no  change  to 
regression  coefficient  estimates  with  ordinal  regression  Regression  coefficient 
estimates  are  completely  robust  to  extreme  Y  valuesd.  Estimates  of  quantiles 
of  Y  from  ordinal  regression  are  exactly  transformation-preserving,  e.g.,  the 
estimate  of  the  median  of  logF  is  exactly  the  log  of  the  estimate  of  the 
median  Y. 

For  a  general  continuous  distribution  function  F(y ),  an  ordinal  regression 
model  based  on  cumulative  probabilities  may  be  stated  as  follows6.  Let  the 
ordered  unique  values  of  Y  be  denoted  by  y\ ,  y<i , . . . ,  yk  and  let  the  intercepts 
associated  with  2/1 , . . . ,  yk  be  aq,  a^, . . . ,  a^,  where  aq  =  00  because  Prob[F  > 
yi\  =  1.  Let  OLy  =  a.i,i  :  yi  =  y.  Then 


Prob[T  >  y% \X\  =  F(ai  +  XP)  =  F(ayi  +  XP) 
For  the  OLS  fully  parametric  case,  the  model  may  be  restated 


^  n  ^  ,tY-X/3  y  -  XX 

Prob[F  >  y\X]  =  Prob[ -  >  - ] 


a 


a 


=  1  -  — Ft)  =  +  F) 


a 


a 


(7 


(15.2) 


(15.3) 

(15.4) 


b  Quantile  regression  allows  the  estimated  value  of  the  0.5  quantile  to  be  higher  than 
the  estimated  value  of  the  0.6  quantile  for  some  values  of  X.  Composite  quantile 
regression690  removes  this  possibility  by  forcing  all  the  X  coefficients  to  be  the  same 
across  multiple  quantiles,  a  restriction  not  unlike  what  cumulative  probability  ordinal 
models  make. 

c  For  symmetric  distributions  applying  a  decreasing  transformation  will  negate  the 
coefficients.  For  asymmetric  distributions  (e.g.,  Gumbel),  reversing  the  order  of  Y 
will  do  more  than  change  signs. 

d  Only  an  estimate  of  mean  Y  from  these  /3s  is  non-robust. 

e  It  is  more  traditional  to  state  the  model  in  terms  of  Prob[T  <  y\X]  but  we  use 
Prob[T  >  y\X]  so  that  higher  predicted  values  are  associated  with  higher  Y. 
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Table  15.1  Distribution  families  used  in  ordinal  cumulative  probability  models.  <2> 
denotes  the  Gaussian  cumulative  distribution  function.  For  the  Connection  column, 
Pi  =  Prob[y  >  y\Xi],P2  =  Prob[T  >  y\X2],A  =  (X2  —  X\ )/3.  The  connection 
specihes  the  only  distributional  assumption  if  the  model  is  htted  semiparametrically, 
i.e,  contains  an  intercept  for  every  unique  Y  value  less  one.  For  parametric  models,  Pi 
must  be  specified  absolutely  instead  of  just  requiring  a  relationship  between  Pi  and 
P2 .  For  example,  the  traditional  Gaussian  parametric  model  specihes  that  Prob[T  > 

y\x\  =  1  - 


Distribution 

F 

Inverse 

(Link  Function) 

Link  Name 

Connection 

Logistic 

[1  +  exp(-y)]-1 

log(T^) 

logit 

-  F'P,  -P(^) 

Gaussian 

Hv) 

^~\y) 

probit 

P2  =  +  A) 

Gumbel  maximum 

value 

exp(—  exp  (-y)) 

log  (  log(y)) 

log  -  log 

p9  —  p®x  p(^) 

Gumbel  minimum 

value 

1  -  exp(—  exp (y)) 

log (  —  log(l  —  y))  complementary 

log  -  log 

1  -  P2  =  (1  -  PDex p(Zi) 

Cauchy 

Tan  ~\y)+i 

tanpr (y  -  J)] 

cauchit 

so  that  to  within  an  additive  constant'  ay  =  F-  (intercepts  a  are  linear  in 
y  whereas  they  are  arbitrarily  descending  in  the  ordinal  model),  and  a  is 
absorbed  in  f3  to  put  the  OLS  model  into  the  new  notation. 

The  general  ordinal  regression  model  assumes  that  for  fixed  Xi,X2, 

F_1(Prob[V  >  y\X2})  -  F-\Prob[Y  >  y\Xi])  (15.5) 

=  (X2  ~  Xx)P  (15.6) 

independent  of  the  os  (parallelism  assumption).  If  F  =  [1  +  exp(— j/)]_1,  this 
is  the  proportional  odds  assumption. 

Common  choices  of  F,  implemented  in  the  R  rms  orm  function,  are  shown 
in  Table  15.1.  The  Gumbel  maximum  value  distribution  is  also  called  the 
extreme  value  type  I  distribution.  This  distribution  (log  —  log  link)  also  rep¬ 
resents  a  continuous  time  proportional  hazards  model.  The  hazard  ratio  when 
X  changes  from  X\  to  X2  is  exp(— (X2  —  X\ )/?). 

The  mean  of  Y\X  is  easily  estimated  from  a  htted  cumulative  probability 
ordinal  model  by  computing 


n 

=  Vi\X]  (15.7) 

2=1 


J  V  /V 

and  the  qin  quantile  ofY\X  is  y  such  that  F~1(  1  —  q)  —  X/3  =  ay 


f  dy  are  unchanged  if  a  constant  is  added  to  all  y. 

g  The  intercepts  have  to  be  shifted  to  the  left  one  position  in  solving  this  equation 
because  the  quantile  is  such  that  Prob[T  <  y]  =  q  whereas  the  model  is  stated  in 
terms  of  Prob[T  >  y}. 
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The  orm  function  in  the  rms  package  takes  advantage  of  the  information 
matrix  being  of  a  sparse  tri-band  diagonal  form  for  the  intercept  parameters. 
This  makes  the  computations  efficient  even  for  hundreds  of  intercepts  (i.e., 
unique  values  of  Y).  orm  is  made  to  handle  continuous  Y. 

Ordinal  regression  has  nice  properties  in  addition  to  those  listed  above, 
allowing  for 

•  estimation  of  quantiles  as  efficiently  as  quantile  regression  if  the  parallel 
slopes  assumptions  hold 

•  efficient  estimation  of  mean  Y 

•  direct  estimation  of  Prob[T  >  y \X] 

•  arbitrary  clumping  of  values  of  Y,  while  still  estimating  (3  and  mean  Y 
efficiently11 

•  solutions  for  (3  using  ordinary  Newton- Raphson  or  other  popular  optimiza¬ 
tion  techniques 

•  being  based  on  a  standard  likelihood  function,  penalized  estimation  can 
be  straightforward 

•  Wald,  score,  and  likelihood  ratio  y2  tests  that  are  more  powerful  than  tests 
from  quantile  regression. 

On  the  last  point,  if  there  is  a  single  predictor  in  the  model  and  it  is  binary, 
the  score  test  from  the  proportional  odds  model  is  essentially  the  Wilcoxon 
test,  and  the  score  test  from  the  Gumbel  log- log  cumulative  probability 
model  is  essentially  the  log-rank  test. 


15.3.1  Minimum  Sample  Size  Requirement 

When  Y  is  continuous  and  the  purpose  of  an  ordinal  model  includes  semi- 
parametric  estimation  of  probabilities  or  quantiles,  the  accuracy  of  estimates 
is  limited  even  more  by  the  accuracy  of  estimating  the  empirical  cumulative 
distribution  of  Y  than  by  estimating  j3 .  When  f3  =  0,  intercept  estimates  are 
transformations  of  the  empirical  distribution  step  function.  As  described  in 
Section  20.3,  the  sample  size  must  be  184  to  estimate  the  entire  distribution 
of  Y  with  a  global  margin  of  error  not  exceeding  0.1.  For  estimating  the  mean 
of  y,  smaller  sample  sizes  may  be  needed. 


h  But  it  is  not  sensible  to  estimate  quantiles  of  Y  when  there  are  heavy  ties  in  Y  in 
the  area  containing  the  quantile. 
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15.4  Comparison  of  Assumptions  of  Various  Models 

Quantile  regression  makes  the  fewest  left-hand-side  model  assumptions  except 
for  the  assumption  that  Y  be  continuous,  but  can  have  less  estimator  precision 
than  other  models  and  has  lower  power.  To  summarize  how  assumptions  of 
parametric  models  compare  to  assumptions  of  semiparametric  ordinal  models, 
consider  the  ordinary  linear  model  or  its  special  case  the  equal  variance  two- 
sample  t-test,  vs.  the  probit  or  logit  (proportional  odds)  ordinal  model  or 
their  special  cases  the  Van  der  Waerden  (normal-scores)  two-sample  rank  test 
or  the  Wilcoxon  two-sample  test.  All  the  assumptions  of  the  linear  model 
other  than  independence  of  residuals  are  captured  in  the  following,  using  the 
more  standard  Y  <  y  notation: 

F(y\X)  =  Prob [Y  <  y\X)  =  $(V  ~  X/3)  (15.8) 

C 7 

0-\F(y\X))  =  y-^l  (i5.9) 

C 7 

On  the  other  hand,  ordinal  models  assume  the  following: 


®_1(F(ylX)) 


0-1(F(ylX)) 

i°git(F(yix)) 


Fig.  15.1  Assumptions  of  the  linear  model  (left  panel)  and  semiparametric  ordi¬ 
nal  probit  or  logit  (proportional  odds)  models  (right  panel).  Ordinal  models  do  not 
assume  any  shape  for  the  distribution  of  Y  for  a  given  X ;  they  only  assume  paral¬ 
lelism.  The  linear  model  can  relax  the  parallelism  assumption  if  a  is  allowed  to  vary, 
but  in  practice  it  is  difficult  to  know  how  to  vary  it  except  for  the  unequal  variance 
two-sample  t-test. 


Prob[V  <  y\X]  =  F(g(y )  -  A/3),  (15.10) 

where  g  is  unknown  and  may  be  discontinuous.  This  translates  to  the  paral¬ 
lelism  assumption  in  the  right  panel  of  Figure  15.1,  whereas  the  linear  model 
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makes  the  additional  strong  assumption  of  linearity  of  normal  inverse  cu¬ 
mulative  distribution  function,  which  arises  from  the  Gaussian  distribution 
assumption. 


15.5  Dataset  and  Descriptive  Statistics 

Diabetes  Mellitus  (DM)  type  II  (adult  onset  diabetes)  is  strongly  associ¬ 
ated  with  obesity.  The  currently  best  laboratory  test  for  diabetes  measures 
glycosylated  hemoglobin  (HbAic),  also  called  glycated  hemoglobin,  glycohe- 
moglobin,  or  hemoglobin  A\c.  HbAic  reflects  average  blood  glucose  for  the 
preceding  60  to  90  days.  HbAic  >  7.0  is  sometimes  taken  as  a  positive  di¬ 
agnosis  of  diabetes  even  though  there  are  no  data  to  support  the  use  of  a 
threshold. 

The  goals  of  this  analyses  are  to  better  understand  effects  of  body  size 
measurements  on  risk  of  DM  and  to  enhance  screening  for  DM.  The  best  way 
to  develop  a  model  for  DM  screening  is  not  to  fit  a  binary  logistic  model 
with  HbAic  >  7  as  the  response  variable.  There  are  at  least  two  reasons  for 
this.  First,  when  the  relationship  between  a  measurement  and  its  ultimate 
clinical  impact  is  smooth,  all  cutpoints  are  arbitrary.  There  is  no  justification 
for  any  putative  cut  on  HbAic.  Second,  such  an  analysis  loses  information  by 
treating  HbAic=2  the  same  as  HbAic=6.9,  and  by  treating  HbAic=7.1  as 
equal  to  HbAic=10.  Failure  to  use  all  available  information  results  in  larger 
standard  errors  of  /3,  lower  power,  and  wider  confidence  bands.  It  is  better  to 
predict  continuous  HbAic  using  a  continuous  response  model,  then  use  that 
model  to  estimate  the  probability  that  HbAic  exceeds  any  cutoff,  or  estimate 
the  0.9  quantile  of  HbAic. 

The  data  used  here  are  from  the  National  Health  and  Nutrition  Examina¬ 
tion  Survey  (NHANES)  2009-2010  from  the  U.S.  National  Center  for  Health 
Statistics/Centers  for  Disease  Control.  The  original  data  may  be  obtained 
from  http://www.cdc.gov/nchs/nhanes.htm94;  the  analysis  file  used  here, 
called  nhgh,  may  be  obtained  from  the  DataSets  wiki  page,  along  with  R  code 
used  to  download  and  create  the  file.  Note  that  CDC  coded  age  >  80  as  80. 
We  use  the  subset  of  subjects  with  age  >21  who  have  neither  been  diagnosed 
nor  treated  for  DM.  Descriptive  statistics  are  shown  below. 

L 

require ( rms ) 

L 

getHdata (nhgh) 

w  V-  subset (nhgh ,  age  >  21  &  dx==0  &  tx==0,  select =-c (dx , tx) ) 

latex (describe (w) ,  file='  ') 
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w 

18  Variables  4629  Observations 


seqn  :  Respondent  sequence  number 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

4629  0  4629  1  56902  52136  52633  54284  56930  59495  61079  61641 


lowest  :  51624  51629  51630  51645  51647 
highest:  62152  62153  62155  62157  62158 


sex 


n  missing  unique 
4629  0  2 


male  (2259,  49°/,),  female  (2370,  51°/,) 


1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  ■>  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  >  . . . . . . . . . 


age  :  Age  [years] 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

4629  0  703  1  48.57  23.33  26.08  33.92  46.83  61.83  74.83  80.00 


lowest  :  21.00  21.08  21.17  21.25  21.33 
highest:  79.67  79.75  79.83  79.92  80.00 


re  :  Race/Ethnicity 

n  missing  unique 
4629  0  5 

Mexican  American  (832,  18°/0)  ,  Other  Hispanic  (474,  10°/0) 
Non-Hispanic  White  (2318,  50°/0)  ,  Non-Hispanic  Black  (756,  16°/0) 
Other  Race  Including  Multi-Racial  (249,  5°/0) 


income  :  Family  Income 

n  missing  unique 

4389  240  14 


[0,5000)  (162,  4°/0)  ,  [5000,10000)  (216,  5%),  [10000,15000)  (371,  8°/„) 
[15000,20000)  (300,  7°/„)  ,  [20000,25000)  (374,  9°/0) 

[25000,35000)  (535,  12°/„)  ,  [35000,45000)  (421,  10°/„) 

[45000,55000)  (346,  8°/„)  ,  [55000,65000)  (257,  6°/0)  ,  [65000,75000)  (188,  4°/„) 
>  20000  (149,  3°/0)  ,  <  20000  (52,  l°/„)  ,  [75000,100000)  (399,  9°/„) 

>=  100000  (619,  14°/„) 


wt  :  Weight  [kg]  . iillllllllllllllllllllllliii . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

4629  0  890  1  80.49  52.44  57.18  66.10  77.70  91.40  106.52  118.00 

lowest  :  33.2  36.1  37.9  38.5  38.7 

highest:  184.3  186.9  195.3  196.6  203.0 


ht  :  Standing  Height  cm]  . . . iilillllllllllllllllllllllllllllllliinii 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

4629  0  512  1  167.5  151.1  154.4  160.1  167.2  175.0  181.0  184.8 

lowest  :  123.3  135.4  137.5  139.4  139.8 
highest:  199.2  199.3  199.6  201.7  202.7 


1 1 1 
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bmi  :  Body  Mass  Index  [kg/m2]  . illllllllllllllllil . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

4629  0  1994  1  28.59  20.02  21.35  24.12  27.60  31.88  36.75  40.68 

lowest  :  13.18  14.59  15.02  15.40  15.49 
highest:  61.20  62.81  65.62  71.30  84.87 

leg  :  Upper  Leg  Length  [cm]  .  . . niiillllilllllllllllllllllllilni . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

4474  155  216  1  38.39  32.0  33.5  36.0  38.4  41.0  43.3  44.6 

lowest  :  20.4  24.9  25.0  25.1  26.4,  highest:  49.0  49.5  49.8  50.0  50.3 


arml  :  Upper  Arm  Length  [cm]  _ _ ....1..11I.I1I11I1I1I1I11I1I1I1I..1.I.1.1..... . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

4502  127  156  1  37.01  32.6  33.5  35.0  37.0  39.0  40.6  41.7 

lowest  :  24.8  27.0  27.5  29.2  29.5,  highest:  45.2  45.5  45.6  46.0  47.0 


armc  :  Arm  Circumference  [cm]  . . illllilllllllllllllliiihi . . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 
4499  130  290  1  32.87  25.4  26.9  29.5  32.5  35.8  39.1  41.4 

lowest  :  17.9  19.0  19.3  19.5  19.9,  highest:  54.2  54.9  55.3  56.0  61.0 


waist  :  Waist  Circumference  [cm]  . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

4465  164  716  1  97.62  74.8  78.6  86.9  96.3  107.0  117.8  125.0 

lowest  :  59.7  60.0  61.5  62.0  62.4 

highest:  160.0  160.6  162.2  162.7  168.7 


. . . 


tri  :  Triceps  Skinfold  [  mm]  . Lllilllllllllllllllilll 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

4295  334  342  1  18.94  7.2  8.8  12.0  18.0  25.2  31.0  33.8 

lowest  :  2.6  3.1  3.2  3.3  3.4,  highest:  39.6  39.8  40.0  40.2  40.6 


sub  :  Subscapular  Skinfold  [mm]  . . nlliilliiliillllJllllJ  . .  i. .... 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

3974  655  329  1  20.8  8.60  10.30  14.40  20.30  26.58  32.00  35.00 

lowest  :  3.8  4.2  4.6  4.8  4.9,  highest:  40.0  40.1  40.2  40.3  40.4 


gh  :  Gly cohemoglobin  [%]  ....iillllllllil . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

4629  0  63  0.99  5.533  4.8  5.0  5.2  5.5  5.8  6.0  6.3 

lowest  :  4.0  4.1  4.2  4.3  4.4,  highest:  11.9  12.0  12.1  12.3  14.5 


albumin  :  Albumin  [g/dL]  . I  I 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

4576  53  26  0.99  4.261  3.7  3.9  4.1  4.3  4.5  4.7  4.8 

lowest  :  2.6  2.7  3.0  3.1  3.2,  highest:  4.9  5.0  5.1  5.2  5.3 
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bun  :  Blood  urea  nitrogen  [mg/dL]  .... .  >i ill lllll . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

4576  53  50  0.99  13.03  7  8  10  12  15  19  22 

lowest  :  12345,  highest:  49  53  55  56  63 


SCr  :  Creatinine  [mg/dL]  .ml _ 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

4576  53  167  1  0.8887  0.58  0.62  0.72  0.84  0.99  1.14  1.25 

lowest  :  0.34  0.38  0.39  0.40  0.41 
highest:  5.98  6.34  9.13  10.98  15.66 


dd  V-  datadist  (w)  ; 


options (datadist  =  'dd  ') 


15.5.1  Checking  Assumptions  of  OLS 
and  Other  Models 

First  let’s  see  if  gh  would  make  a  Gaussian  residuals  model  fit.  Use  ordinary 
regression  on  four  key  variables  to  collapse  these  into  one  variable  (predicted 
mean  from  the  OLS  model).  Stratify  the  predicted  means  into  six  quantile 
groups.  Apply  the  normal  inverse  cumulative  distribution  function  to  the 
empirical  cumulative  distribution  functions  (ECDF)  of  gh  using  these  strata, 
and  check  for  normality  and  constant  a2.  The  ECDF  estimates  Prob[E  < 
y\X]  but  for  ordinal  modeling  we  want  to  state  models  in  terms  of  Prob[E  > 
y \X]  so  take  one  minus  the  ECDF  before  inverse  transforming. 


f 

V-  ols  (gh  ~  res 

( age  ,  5) 

+  s 

ex  + 

re  +  r  c 

s (bmi  , 

3) 

,  data=w 

L 

) 

pgh  fitted(f) 

P 

V-  function  (fun 

,  row  , 

col  ) 

{ 

f  V-  substitute 

(fun) ; 

g 

fun 

ct ion (F) 

eval  ( f 

) 

z  Ecdf  (~  gh  , 

groups 

=  cut 

2 (pgh  ,  g  =  6)  , 

f  un  =  f 

unction 

(F) 

g  (1 

"  F)  , 

y  1  ab  = 

as . expr 

e  s  s  i 

on  ( f 

) ,  xlim= 

c (4 . 5  , 

7  . 

75 ) ,  dat 

a  =  w  , 

label 

. curve  = 

FALSE) 

\ 

pr int (z ,  spl i t  = 

c ( col  , 

row  , 

2, 

2) ,  more 

=  row  < 

2 

I  col  < 

2) 

s 

p( 

log (F/Cl-F)) , 

i ,  i) 

p( 

qnorm ( F )  , 

1,  2) 

p( 

-log ( -log (F) )  , 

2,  1) 

P( 

log ( -log ( 1 -F) )  , 

2,  2) 

# 

Get  slopes  of  pgh  for 

some 

cut 

offs  of 

Y 

# 

Use  glm  complem 

entary 

log- 

log 

link  on 

Prob  (Y 

< 

cutoff) 

to 

# 

get  log-log  link  on  Prob  (Y 

> 

cutoff) 

r 

V-  NULL 

f  0 

r ( link  in  c (  '  lo 

git  '  ,  '  p 

r  obi 

t  '  ,  ' 

cloglog  ' 

)) 

for(k  in  c(5,  5 

.5,  6)) 

{ 

co  V-  coef (gl 

m  (gh  < 

k  ~ 

Pgh  » 

dat  a=w , 

family 

=  b 

inomial  ( 

link))) 
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r  rbind(r,  data . frame ( link=link ,  cutoff=k, 

slope =round ( co [2] ,2))) 

> 

print (r,  row . names =FALSE ) 


link 

cutoff  slope 

logit 

5.0  -3.39 

logit 

5.5  -4.33 

logit 

6.0  -5.62 

probit 

5.0  -1.69 

probit 

5.5  -2.61 

probit 

6.0  -3.07 

cloglog 

5.0  -3.18 

cloglog 

5.5  -2.97 

cloglog 

6.0  -2.51 

I 

LL 

O) 

O 


O 

c 

cr 


2 


0 


-2 


Glycohemoglobin,  % 


5.0  5.5  6.0  6.5  7.0  7.5 
Glycohemoglobin,  % 


CD 

O 


CD 

O 


2 


CD 

O 


CD 

O 


0 

-2 

-4 

-6 


Glycohemoglobin,  % 


5.0  5.5  6.0  6.5  7.0  7.5 
Glycohemoglobin,  % 


Fig.  15.2  Examination  of  normality  and  constant  variance  assumption,  and  assump¬ 
tions  for  various  ordinal  models 
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The  upper  right  curves  in  Figure  15.2  are  not  linear,  implying  that  a  normal 
conditional  distribution  cannot  work  for  gh1  There  is  non-parallelism  for  the 
logit  model.  The  other  graphs  will  be  used  to  guide  selection  of  an  ordinal 
model  below. 


15.6  Ordinal  Regression  Applied  to  HbAic 

In  the  upper  left  panel  of  Figure  15.2,  logit  inverse  curves  are  not  parallel 
so  the  proportional  odds  assumption  does  not  hold  when  predicting  HbAic. 
The  log-log  link  yields  the  highest  degree  of  parallelism  and  most  constant 
regression  coefficients  across  cutoffs  of  gh,  so  we  use  this  link  in  an  ordinal 
regression  model  (linearity  of  the  curves  is  not  required). 


15.6.1  Checking  Fit  for  Various  Models  Using  Age 


Another  way  to  examine  model  fit  is  to  flexibly  fit  the  single  most  important 
predictor  (age)  using  a  variety  of  methods,  and  compare  predictions  to  sample 
quantiles  and  means  based  on  subsets  on  age.  We  use  overlapping  subsets 
to  gain  resolution,  with  each  subset  composed  of  those  subjects  having  age 
within  five  years  of  the  point  being  predicted  by  the  models.  Here  we  predict 
the  0.5,  0.75,  and  0.9  quantiles  and  the  mean.  For  quantiles  we  can  compare 
to  quantile  regression  (discussed  below)  and  for  means  we  compare  to  OLS. 

ag  4 —  25:75 

lag  V-  length  (  ag ) 

q2  q3  4 —  p90  4 —  means  4 —  numeric (lag) 

f  or ( i  in  1 : lag)  { 

s  4—  which  ( abs  (w$age  -  ag[i])  <  5) 
y  V-  w$gh  [s] 

a  V-  quantile(y,  probs  =  c(.5,  .75,  .9)) 

q2  [  i  ]  4 —  a  [  1  ] 

q3  [  i  ]  4 —  a  [  2  ] 

p90[i]  V-  a[3] 

means  [i]  V-  mean(y) 

} 

fams  4—  c(  'logistic',  'probit',  '  loglog  '  ,  'cloglog') 

fe  4—  function  (pred  ,  target)  mean  (  abs  ( pred$  yhat  -  target)) 

mod  V-  gh  cs (age ,6) 

P  •<—  Er  •<—  1  i  st  ( ) 

f or (est  in  c (  '  q2  '  ,  ' q3  '  ,  'p90',  'mean'))  { 

meth  V-  if(est  ==  'mean')  '  ols  '  else  '  QR  ' 
p  V-  list  () 
er  V-  r ep  ( NA  ,  5) 

names(er)  V-  c  (f  ams  ,  meth) 
for(family  in  fams)  { 

h  4—  orm(mod,  f ami ly  =  f ami ly  ,  data  =  w) 
fun  if(est  ==  'mean')  Mean(h) 

else  { 

qu  4—  Quantile  (h) 


1  They  are  not  parallel  either. 
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switch (est ,  q2  =  function(x)  qu ( 

.5, 

x)  , 

q3  =  function(x)  qu ( 

.75  , 

x)  , 

"I 

p90  =  function(x)  qu ( 

.9, 

x)) 

P[ 

[family]]  <(—  z  <—  Predict(h,  age  = 

ag  , 

f  un  = 

fun  , 

CO 

nf  . 

int  =  FALSE ) 

er 

> 

[family]  <(—  fe(z,  switch  (est  ,  mean  =  me 

ans  , 

q2  = 

q2 , 

q3 

=q3 ,  p90=p90 ) ) 

s 

h  v- 

switch (est  , 

mean=  ols (mod  ,  data  =  w)  , 

q2  =  Rq  (mod,  data=w) , 

q3  =  Rq  (mod,  tau=0.75, 

dat  a 

=  w)  , 

p90  =  Rq  (mod,  tau=0.90, 

dat  a 

=  w)  ) 

p  [  [m 

eth]]  <(—  z  <(—  Predict  (h,  age  =  ag, 

conf 

.  int 

=FALSE) 

er  [m 

eth]  <(—  fe(z,  switch (est ,  mean=me 

ans  , 

q2  = 

q2 , 

q3  = 

q3 , 

p90=p90 ) ) 

Er  [[ 

est ] ]  <(—  er 

pr  4- 

-  do . call  (  '  rbind  '  ,  p) 

pr  $  e 

st  <(—  est 

P  <- 

> 

rbind . data . frame (P ,  pr) 

xyplot 

(yhat  ~  age  I  est,  groups= . set .  , 

dat  a 

=  P, 

type 

=  '1 

f 

9 

#  Figure  15 . 3 

auto . key =list (x= . 75 ,  y=.2,  point 

s  =FALSE  , 

lin 

es  = 

TRUE)  , 

pane 1 = f unc t i on (...,  subscripts) 

{ 

panel .xyplot( . . . ,  subscripts=s 

ubs  c 

r  ipt 

s) 

est  P $  e s t  [  subs cr  ipt  s  [  1]  ] 

lpoints(ag,  switch (est  ,  mean  =  m 

eans 

,  q2 

=  q2  , 

q3 

=  q3 

,  p90=p90 ) , 

col=gray ( . 7 ) ) 

er  <—  f ormat ( round ( Er  [[ e st ] ]  ,3 

)  >  n 

small =3) 

ltext (26 ,  6.15,  pas t e ( name s ( er 

)  >  c 

ollapse  = 

An 

’)  , 

cex=  .  7  ,  ad j  =0) 

ltext (40  ,  6.15,  paste (er  ,  coll 

apse 

=  An 

')  , 

cex= .7  ,  ad j  = 1 ) } ) 

It  can  be  seen  in  Figure  15.3  that  models  dedicated  to  a  specific  task 
(quantile  regression  for  quantiles  and  OLS  for  means)  were  best  for  those 
tasks.  Although  the  log-log  ordinal  cumulative  probability  model  did  not 
estimate  the  median  as  accurately  as  some  other  methods,  it  does  well  for 
the  0.75  and  0.9  quantiles  and  is  the  best  compromise  overall  because  of 
its  ability  to  also  directly  predict  the  mean  as  well  as  quantities  such  as 
Prob[HbAic  >  7\X]. 

From  here  on  we  focus  on  the  log-log  ordinal  model.  Returning  to  the 
bottom  left  of  Figure  15.2,  let’s  look  at  quantile  groups  of  predicted  HbAic 
by  OLS  and  plot  predicted  distributions  of  actual  HbAic  against  empirical 
distributions. 

w$pghg  V-  cut  2 ( pgh ,  g  =  6) 
f  V-  orm(gh  ~  pghg,  data  =  w) 

IP  V-  predict  (f  ,  newdat a  =  dat a . f r ame  ( pghg  =  le vel s ( w$pghg ) ) ) 
ep  ExProb(f)  #  Exceedance  prob .  functn.  generator  in  rms 
z  v-  ep(lp) 

j  order (w$pghg )  #  puts  in  order  of  Ip  (levels  of  pghg) 

plot  (z ,  xlim  =  c(4,  7.5),  dat a  =  w  [j  ,  c  (' pghg  '  ,  ' gh ' ) ] )  #  Fig.  15.4 

Agreement  between  predicted  and  observed  exceedance  probability  distribu¬ 
tions  is  excellent  in  Figure  15.4. 
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30  40  50  60  70 


age 

Fig.  15.3  Three  estimated  quantiles  and  estimated  mean  using  6  methods,  compared 
against  caliper-matched  sample  quantiles/means  (circles).  Numbers  are  mean  abso¬ 
lute  differences  between  predicted  and  sample  quantities  using  overlapping  intervals 
of  age  and  caliper  matching.  QR: quantile  regression. 


To  return  to  the  initial  look  at  a  linear  model  with  assumed  Gaussian 
residuals,  fit  a  probit  ordinal  model  and  compare  the  estimated  intercepts  to 
the  linear  relationship  with  gh  that  is  assumed  by  the  normal  distribution. 


f 

<— 

orm  ( gh  ~ 

res ( ag 

e ,6) ,  family=p 

robit  ,  dat a  =  w) 

L 

g 

<— 

ols (gh  ~ 

res ( ag 

e , 6) ,  dat a=w) 

s 

<— 

g$  st  at  s  [ 

' Sigma  ' 

] 

yu 

e- 

f  $yunique  [-1 ] 

r 

<— 

quantile 

(w$gh  , 

c (  .  005  ,  .995)) 

alpha 

s  <—  coei 

: (f )  [1: 

num . intercepts 

(f  )] 

Pi 

ot  ( 

-yu  /  s  , 

alphas 

,  t ype  =  '1  '  ,  xl 

im=rev ( -  r  /  s ) , 

#  Fig 

.  15.5 

h . . M 

xl 

ab=expre£ 

jsion  ( - 

y/hat ( sigma) )  , 

ylab=expression 

( alpha 

[y])) 

Figure  15.5  depicts  a  significant  departure  from  the  linear  form  implied  by 
Gaussian  residuals  (Eq.  15.4). 
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4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5 


Qh 

Fig.  15.4  Observed  (dashed  lines,  open  circles)  and  predicted  (solid  lines,  closed  cir¬ 
cles)  exceedance  probability  distributions  from  a  model  using  6-tiles  of  OLS-predicted 
HbAic.  Key  shows  quantile  group  intervals  of  predicted  mean  HbAic. 


Fig.  15.5  Estimated  intercepts  from  probit  model.  Linearity  would  have  indicated 
Gaussian  residuals. 
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15.6.2  Examination  of  BMI 

Body  mass  index  (BMI,  weight  divided  by  height2)  is  commonly  used  as  an 
obesity  measure  because  it  is  well  correlated  with  abdominal  visceral  fat. 
But  it  is  not  obvious  that  BMI  is  the  correct  summary  of  height  and  weight 
for  predicting  pre-clinical  diabetes,  and  it  may  be  the  case  that  body  size 
measures  other  than  height  and  weight  are  better  predictors. 

Use  the  log-log  ordinal  model  to  check  the  adequacy  of  BMI,  adjusting  for 
age  (without  assuming  linearity).  This  can  be  done  by  examining  the  ratio 
of  coefficients  of  log  height  and  log  weight,  and  also  by  using  AIC  to  judge 
whether  BMI  is  an  adequate  summary  of  height  and  weight  when  compared 
to  nonlinear  functions  of  the  logs,  and  to  a  tensor  spline  interaction  surface. 

L 

f  orm(gh  ~  res  (age  ,5)  +  log(ht)  +  log(wt), 

f amily=loglog ,  data=w) 
print (f  ,  latex  =  TRUE) 


-log-log  Ordinal  Regression  Model 

orm(formula  =  gh  rcs(age,  5)  +  log(ht)  +  log(wt) ,  data  =  w, 
family  =  loglog) 


Model  Likelihood 
Ratio  Test 

Discrimination 

Indexes 

Rank  Discrim. 
Indexes 

Obs  4629 
Unique  Y  63 
To.  5  5.5 

d  log  L 

max  e| 

lx  nr6 

LR  x2  1126.94 

d.f.  6 

Pr(>  x2)  <  0.0001 

Score  x2  1262.81 
Pr(>  x2)  <  0.0001 

R2  0.217 

g  0.627 

gr  1.872 

p  0.486 

Pr(T  >  Tq. 5)  ~  \  0.153 

Coef  S.E.  Wald  Z  Pr(>  \Z\) 


age 

0.0398  0.0055 

7.29 

<  0.0001 

age’ 

-0.0158  0.0275 

-0.57 

0.5657 

age” 

-0.0072  0.0866 

-0.08 

0.9333 

age”’ 

0.0309  0.1135 

0.27 

0.7853 

ht 

-3.0680  0.2789 

-11.00 

<  0.0001 

wt 

1.2748  0.0704 

18.10 

<  0.0001 

aic  NULL 

for (mod  in  list(gh  ~  res (age ,5)  +  res ( log (bmi ) , 5) , 

gh  rsj  27  C  S  (age, 5)  +  res ( log (ht )  ,  5)  +  res ( log ( wt )  ,  5)  , 
gh  n-j  27  C  S  (age, 5)  +  res ( log (ht )  ,  4)  *  res ( log ( wt )  ,  4) ) ) 
aic  <(—  c(aic,  AIC(orm(mod,  f  ami  ly  =  loglog  ,  data  =  w))) 
print ( aic ) 


[1]  25910.77  25910.17  25906.03 
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The  ratio  of  the  coefficient  of  log  height  to  the  coefficient  of  log  weight  is  - 
2.4,  which  is  between  what  BMI  uses  and  the  more  dimensionally  reasonable 
weight  /  height3.  By  AIC,  a  spline  interaction  surface  between  height  and 
weight  does  slightly  better  than  BMI  in  predicting  HbAic,  but  a  nonlinear 
function  of  BMI  is  barely  worse.  It  will  require  other  body  size  measures  to 
displace  BMI  as  a  predictor. 

As  an  aside,  compare  this  model  fit  to  that  from  the  Cox  proportional 
hazards  model.  The  Cox  model  uses  a  conditioning  argument  to  obtain 
a  partial  likelihood  free  of  the  intercepts  a  (and  requires  a  second  step  to 
estimate  these  log  discrete  hazard  components)  whereas  we  are  using  a  full 
marginal  likelihood  of  the  ranks  of  T330. 

L 

pr int ( cph ( Surv ( gh)  ~  res (age  ,5)  +  log(ht)  +  log(wt),  data  =  w)  , 

latex=TRUE) 


Cox  Proportional  Hazards  Model 

cph (formula  =  Surv(gh)  ~  res (age,  5)  +  log(ht) 

+  log(wt) ,  data  =  w) 


Model  Tests 

Discrimination 

Indexes 

Obs  4629 

Events  4629 
Center  8.3792 

LR  x2  1120.20 
d.f.  6 

Pr(>  x2)  0.0000 
Score  x2  1258.07 
Pr(>  x2)  0.0000 

ti2  0.215 

Dxy  0.359 

g  0.622 

gr  1.863 

Coef  S.E.  Wald  Z  Pr(>  \Z\) 


age 

-0.0392 

0.0054 

-7.24 

<  0.0001 

age’ 

0.0148 

0.0274 

0.54 

0.5888 

age” 

0.0093 

0.0862 

0.11 

0.9144 

age”’ 

-0.0321 

0.1131 

-0.28 

0.7767 

ht 

3.0477 

0.2779 

10.97 

<  0.0001 

wt 

-1.2653 

0.0701 

-18.04 

<  0.0001 

Close  agreement  of  the  two  is  seen,  as  expected. 


15.6.3  Consideration  of  All  Body  Size  Measurements 

Next  we  examine  all  body  size  measures,  and  check  their  redundancies. 

v  V-  varclus  (<~  wt  +  ht  +  bmi  +  leg  +  arml  +  armc  +  waist  + 

tri  +  sub  +  age  +  sex  +  re ,  data=w) 

plot (v) 
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#  Omit  wt  so 

i  t 

won  '  t 

b  e 

removed  before  bmi 

redun  ht  + 

bmi 

+  leg 

+ 

arml  +  armc  +  waist  +  tri  +  sub  , 

dat a=w , 

r2 

=  .75) 

Redundancy  Analysis 

redun (formula  =  ~ht  +  bmi  +  leg  +  arml  +  armc  +  waist  +  tri  + 
sub,  data  =  w,  r2  =  0.75) 

n :  3853  p :  8  nk :  3 

Number  of  NAs :  776 


Frequencies  of  Missing  Values 

Due  to 

Each 

Variable 

ht  bmi  leg 

arml 

armc 

waist 

tri 

sub 

0  0  155 

127 

130 

164 

334 

655 

Transformation  of 

target 

variables  forced 

to  be  linear 

R2  cutoff:  0.75  Type:  ordinary 

Q 

R^  with  which  each  variable  can  be  predicted  from  all  other  variables: 

ht  bmi  leg  arml  armc  waist  tri  sub 
0.829  0.924  0.682  0.748  0.843  0.864  0.531  0.594 

Rendundant  variables: 

bmi  ht 

Predicted  from  variables: 
leg  arml  armc  waist  tri  sub 

Variable  Deleted  R2  R2  after  later  deletions 

1  bmi  0.924  0.909 

2  ht  0.792 


Six  size  measures  adequately  capture  the  entire  set.  Height  and  BMI  are 
removed  (Figure  15.6).  An  advantage  of  removing  height  is  that  it  is  age- 
dependent  due  to  vertebral  compression  in  the  elderly: 

L 

f  V-  orm(ht  rsj  res  ( age  ,  4)  *  sex  ,  data  =  w)  #  Prop,  odds  model 
qu  V-  Quantile(f);  med  function(x)  qu  (  .  5  ,  x) 

ggplot  ( Predi ct (f  ,  age,  sex,  fun  =  med ,  conf . int =FALSE )  , 
ylab =  '  Pr edi ct ed  Median  Height,  cm') 


However,  upper  leg  length  has  the  same  declining  trend,  implying  a  survival 
bias  or  birth  year  effect. 

In  preparing  to  create  a  multivariable  model,  degrees  of  freedom  are  allo¬ 
cated  according  to  the  generalized  Spearman  p2 (Figure  15.7)h 


spearman2  (gh  ~  age  +  sex  +  re  +  wt  +  leg  +  arml  +  armc  + 
waist  +  tri  +  sub,  data=w,  p=2) 

plot ( s ) 


Parameters  will  be  allocated  in  descending  order  of  p2 .  But  note  that 
subscapular  skinfold  has  a  large  number  of  NAs  and  other  predictors  also  have 
NAs.  Suboptimal  casewise  deletion  will  be  used  until  the  final  model  is  fitted 
(Figure  15.8). 


J  Competition  between  collinear  size  measures  hurts  interpretation  of  partial  tests  of 
association  in  a  saturated  additive  model. 
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Fig.  15.6  Variable  clustering  for  all  potential  predictors 


sex 

—  male 
female 


Age,  years 


Fig.  15.7  Estimated  median  height  as  a  smooth  function  of  age,  allowing  age  to 
interact  with  sex,  from  a  proportional  odds  model 


Because  there  are  many  competing  body  measures,  we  use  backwards  step- 
down  to  arrive  at  a  set  of  predictors.  The  bootstrap  will  be  used  to  penal¬ 
ize  predictive  ability  for  variable  selection.  First  the  full  model  is  fit  using 
casewise  deletion,  then  we  do  a  composite  test  to  assess  whether  any  of  the 
frequently-missing  predictors  is  important. 

L 

f  V-  orm(gh  ~  res  (age  ,5)  +  sex  +  re  +  res  (wt  ,3)  +  rcs(leg  ,3)  +  arml  + 
r c  s ( armc  ,3)  +  res (waist  ,4)  +  tri  +  rcs( sub  ,  3 )  , 
f amily= ' loglog 1 ,  data=w,  x=TRUE ,  y=TRUE) 
print (f ,  latex=TRUE ,  coefs=FALSE) 
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Spearman  p2  Response  :  gh 

age 
waist 
leg 
sub 
armc 
wt 
re 
tri 
arml 
sex 

0.00  0.05  0.10  0.15  0.20 


N  df 
4629  2 

4465  2 

4474  2 

3974  2 

4499  2 

4629  2 

4629  4 

4295  2 

4502  2 

4629  1 


Adjusted  p2 


Fig.  15.8  Generalized  squared  rank  correlations 


-log-log  Ordinal  Regression  Model 

orm(formula  =  gh  res (age,  5)  +  sex  +  re  +  rcs(wt,  3) 

+  res (leg,  3)  +  arml  +  res (armc,  3)  +  res (waist,  4) 
+  tri  +  rcs(sub,  3),  data  =  w,  x  =  TRUE,  y  =  TRUE, 
family  =  "loglog") 


Frequencies  of  Missing  Values  Due  to  Each  Variable 

N 
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Model  Likelihood 
Ratio  Test 

Discrimination 

Indexes 

Rank  Discrim. 
Indexes 

Obs  3853 
Unique  Y  60 
Y0. 5  5.5 

d  log  L 

max  af 

3x  10-5 

LR  x2  1180.13 

d.f.  22 

Pr(>  x2)  <  0.0001 

Score  x2  1298.88 
Pr(>  x2)  <  0.0001 

R2  0.265 

g  0.732 

gr  2.080 

p  0.520 

|Pr(y  >  Yo.s)  -  \  0.172 

L 

#  #  Composite  test  : 

lan  <—  function(a)  latex(a,  table  .  env  =  FALSE  ,  file='  ') 
lan(anova(f,  leg,  arml ,  armc ,  waist,  tri ,  sub)) 

A2 

d.f. 

p 

leg 

8.30 

2 

0.0158 

Nonlinear 

3.32 

1 

0.0685 

arml 

0.16 

1 

0.6924 

armc 

6.66 

2 

0.0358 

Nonlinear 

3.29 

1 

0.0695 

waist 

29.40 

3 

<  0.0001 

Nonlinear 

4.29 

2 

0.1171 

tri 

16.62 

1  <  0.0001 

sub 

40.75 

2  <  0.0001 

Nonlinear 

4.50 

1 

0.0340 

TOTAL  NONLINEAR 

14.95 

5 

0.0106 

TOTAL 

128.29 

11  <  0.0001 

The  model  achieves  Spearman  p  =  0.52,  the  rank  correlation  between 
predicted  and  observed  HbAic. 

We  show  the  predicted  mean  and  median  HbAic  as  a  function  of  age, 
adjusting  other  variables  to  their  median  or  mode  (Figure  15.9).  Compare  the 
estimate  of  the  median  and  90^  percentile  with  that  from  quantile  regression. 


M 

Mean  ( f ) 

qu 

Quantile  (f ) 

med 

function (x) 

qu  ( 

•  5  ,  x) 

p90 

function (x) 

qu  ( 

•  9  ,  x) 

fq 

Rq ( f  ormula  ( j 

O  , 

dat a=w 

f  q90 

Rq ( f  ormula ( j 

O  , 

dat a=w 

pmean 

<— 

Predict  (f  , 

age  , 

fun=M,  conf . int =FALSE ) 

pmed 

Predict (f , 

age  , 

fun=med ,  conf . int =FALSE ) 

p90 

Predict (f , 

age  , 

fun=p90 ,  conf . int =FALSE ) 

pmedqr 

Predict (f q , 

age  , 

conf . int=FALSE) 

p90qr 

Predict (f  q90 

»  age  , 

conf . int=FALSE) 

z  V-  rbind (  '  orm  mean  '  = 

pmean  ,  ' 

orm  median  '=pmed  ,  'orm  P90  '=p90  , 

' QR  median  ' 

=pmedqr 

,  'QR  P90 ' =p90qr ) 

ggplot  ( 

z  > 

groups  =  '  .set 

t 

•  9 

adj 

. subt it le =FALSE  ,  le 

gend. label=FALSE) 
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print (fastbw  (f  , 


rule  =  '  p  '  )  , 


L 


estimates=FALSE) 


Age,  years 


orm  mean 
orm  median 
orm  P90 
QR  median 
QR  P90 


Fig.  15.9  Estimated  mean  and  0.5  and  0.9  quantiles  from  the  log-log  ordinal  model 
using  casewise  deletion,  along  with  predictions  of  0.5  and  0.9  quantiles  from  quantile 
regression  (QR).  Age  is  varied  and  other  predictors  are  held  constant  to  medians/- 
modes. 


Deleted 

Chi-Sq 

d  .  f  . 

P 

Residual 

d  .  f  . 

P 

AIC 

arml 

0 . 16 

1 

0 . 6924 

0 . 16 

1 

0 . 6924 

-1 . 84 

sex 

0 . 45 

1 

0 . 5019 

0 .61 

2 

0 . 7381 

-3 . 39 

wt 

5 . 72 

2 

0 . 0572 

6 . 33 

4 

0 . 1759 

-1 . 67 

armc 

3 . 32 

2 

0 . 1897 

9 . 65 

6 

0 . 1400 

-2 . 35 

Factors 

in  Final 

Model 

[1]  age 

re 

leg 

waist 

tri  sub 

set.seed(13)  #  so  can  reproduce  results 

v  validate (f ,  B=100 ,  bw=TRUE ,  e s t imat e s =FALSE ,  rule='p') 
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Backwards  Step-down  -  Original  Model 


Deleted 

Chi-Sq 

d  .  f  . 

P 

Residual 

d  .  f  . 

P 

AIC 

arml 

0 . 16 

1 

0 . 6924 

0 . 16 

1 

0 . 6924 

-1 . 84 

sex 

0 . 45 

1 

0 . 5019 

0 .61 

2 

0 . 7381 

-3 . 39 

wt 

5 . 72 

2 

0 . 0572 

6 . 33 

4 

0 . 1759 

-1 . 67 

armc 

3 . 32 

2 

0 . 1897 

9 . 65 

6 

0 . 1400 

-2 . 35 

Factors 

in  Final 

Model 

[1]  age 

re 

leg 

waist 

tri  sub 

#  Show  number  of  variables  selected  in  first  30  boots 
latex (v ,  B=30,  f ile  =  '  '  ,  size= 1  small  1  ) 


Index 

Original  Training 
Sample  Sample 

Test 

Sample 

Optimism 

Corrected 

Index 

n 

P 

0.5225 

0.5290 

0.5208 

0.0083 

0.5142 

100 

R2 

0.2712 

0.2788 

0.2692 

0.0095 

0.2617  100 

Slope 

1.0000 

1.0000 

0.9761 

0.0239 

0.9761 

100 

9 

1.2276 

1.2505 

1.2207 

0.0298 

1.1978 

100 

|Pr (Y  >  Y0. 5)  -  | 

0.2007 

0.2050 

0.1987 

0.0064 

0.1943 

100 

Factors  Retained  in  Backwards  Elimination 
First  30  Resamples 

age  sex  re  wt  leg  arml  armc  waist  tri  sub 
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Frequencies  of  Numbers  of  Factors  Retained 


5  6  7  8  9  10 
1  19  29  46  4  1 


Next  we  fit  the  reduced  model,  using  multiple  imputation  to  impute  miss¬ 
ing  predictors  (Figure  15.10). 


a  areglmpute(~  gh  +  wt  +  ht  +  bmi  +  leg  +  arml  +  armc  +  waist  + 

tri  +  sub  +  age  +re ,  data=w,  n.impute=5,  pr=FALSE) 

g  f  it  .  mult  .  impute  (gh  ~  res  (age  ,5)  +  re  +  rcs(leg,3)  + 

res  (waist  ,4)  +  tri  +  res  (sub  ,4)  , 

orm,  a,  f ami ly = loglog ,  data=w,  pr=FALSE) 

print(g,  latex  =  TRUE ,  needspace=  '  1 . 5in  '  ) 

L 

-log-log  Ordinal  Regression  Model 


fit .mult . impute (formula  =  gh  res (age,  5)  +  re  +  res (leg,  3) 
+  rcs(waist,  4)  +  tri  +  rcs(sub,  4),  fitter  =  orm, 
xtrans  =  a,  data  =  w,  pr  =  FALSE,  family  =  loglog) 


Model  Likelihood 
Ratio  Test 

Discrimination 

Indexes 

Rank  Discrim. 
Indexes 

Obs  4629 
Unique  Y  63 
Y0.5  5.5 

d  log  L 

max  a| 

lxlCT5 

LR?  1448.42 

d.f.  17 

Pr(>  x2)  <  0.0001 
Score  x2  1569.21 
Pr(>  x2)  <  0.0001 

K2  0.269 

g  0.743 

gr  2.102 

p  0.513 

|Pr(y  >  y0.6)  -  \  0.173 

Coef  S.E.  Wald  Z  Pr(>  \Z\) 


age 

0.0404 

0.0055 

7.29 

<  0.0001 

age’ 

-0.0228 

0.0279 

-0.82 

0.4137 

age” 

0.0126 

0.0876 

0.14 

0.8857 

age”’ 

0.0424 

0.1148 

0.37 

0.7116 

re=Other  Hispanic 

-0.0766 

0.0597 

-1.28 

0.1992 

re=Non-Hispanic  White 

-0.4121 

0.0449 

-9.17 

<  0.0001 

re=Non-Hispanic  Black 

0.0645 

0.0566 

1.14 

0.2543 

re=Other  Race  Including  Multi-Racial 

-0.0555 

0.0750 

-0.74 

0.4593 

leg 

-0.0339 

0.0091 

-3.73 

0.0002 

leg’ 

0.0153 

0.0105 

1.46 

0.1434 

waist 

0.0073 

0.0050 

1.47 

0.1428 

waist’ 

0.0304 

0.0158 

1.93 

0.0536 

waist” 

-0.0910 

0.0508 

-1.79 

0.0732 

tri 

-0.0163 

0.0026 

-6.28 

<  0.0001 

sub 

-0.0027 

0.0097 

-0.28 

0.7817 

sub’ 

0.0674 

0.0289 

2.33 

0.0198 

sub” 

-0.1895 

0.0922 

-2.06 

0.0398 
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an  anova(g) 

lan ( an ) 


A2 

d.f.  P 

age 

692.50 

4  <  0.0001 

Nonlinear 

28.47 

3  <  0.0001 

re 

168.91 

4  <  0.0001 

leg 

24.37 

2  <  0.0001 

Nonlinear 

2.14 

1  0.1434 

waist 

128.31 

3  <  0.0001 

Nonlinear 

4.05 

2  0.1318 

tri 

39.44 

1  <  0.0001 

sub 

39.30 

3  <  0.0001 

Nonlinear 

6.63 

2  0.0363 

TOTAL  NONLINEAR 

46.80 

8  <  0.0001 

TOTAL 

1464.24 

17  <  0.0001 

b  anova(g,  leg  ,  waist,  tri  ,  sub) 

#  Add  new  lines  to  the  plot  with  combined  effect  of  f  size  v  a  r  . 

s  •<—  rbind  (an  ,  size  =  b  [' TOTAL  '  ,  ]) 
class  (s)  V-  'anova.rms  1 
plot ( s ) 


Fig.  15.10  ANOVA  for  reduced  model,  after  multiple  imputation,  with  addition  of 
a  combined  effect  for  four  size  variables 


ggplot ( Predict (g) ,  abbrev=TRUE ,  ylab=NULL)  #  Figure  15.11 


384  15  Regression  Models  for  Continuous  Y  and  Case  Study  in  Ordinal  Regression 


Compare  the  estimated  age  partial  effects  and  confidence  intervals  with 
those  from  a  model  using  casewise  deletion,  and  with  bootstrap  nonparamet- 
ric  confidence  intervals  (also  with  casewise  deletion). 
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Fig.  15.11  Partial  effects  (log  hazard  or  log-log  cumulative  probability  scale)  of  all 
predictors  in  reduced  model,  after  multiple  imputation 
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Fig.  15.12  Partial  effect  for  age  from  multiple  imputation  (center  red  line)  and 
casewise  deletion  (center  blue  line)  with  symmetric  Wald  0.95  confidence  bands  using 
casewise  deletion  (gray  shaded  area),  basic  bootstrap  confidence  bands  using  casewise 
deletion  (blue  lines),  percentile  bootstrap  confidence  bands  using  casewise  deletion 
(dashed  blue  lines),  and  symmetric  Wald  confidence  bands  accounting  for  multiple 
imputation  (red  lines). 


Figure  15.13  depicts  the  relationship  between  various  predicted  quantities, 
demonstrating  that  the  ordinal  model  makes  fewer  model  assumptions  that 
dictate  their  connections.  A  Gaussian  or  log- Gaussian  model  would  have  a 
straight-line  relationship  between  the  predicted  mean  and  median. 
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lines  (pmn  ,  p90  , 

col = ' blue  '  ) 

abline(a=0,  b=l 

,  col  =  gray (  .  8 ) ) 

text  (6.5,  5.5, 

' Median  '  ) 

text  (5.5,  6.3, 

'0.9',  col = ' blue  '  ) 

nint  V-  350 

scat  Id (M ( lp )  , 

nint  =  nint ) 

scat  Id (med(lp)  , 

side=2,  nint=nint) 

scat  Id (q90(lp)  , 

side=4,  col='blue',  nint=nint) 
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Fig.  15.13  Predicted  mean  HbAicvs.  predicted  median  and  0.9  quantile  along  with 
their  marginal  distributions 


Finally,  let  us  draw  a  nomogram  that  shows  the  full  power  of  ordinal 
models,  by  predicting  five  quantities  of  interest. 

L 

g  V-  Newlevels  (g  ,  list  (re  =  abbreviate  (  levels  (w$re  )))  ) 

exprob  V-  ExProb (g) 
nom  V- 

nomogram (g ,  f un=list (Mean=M , 

'Median  Gly cohemoglobin  '  =  med  , 

'0.9  Quantile'  =  q90  , 

'  Prob  (  HbA  1  c  >  6.5)  '  = 

function(x)  exprob (x ,  y=6.5), 

'  Prob  (  HbA  1  c  >  7.0)  '  = 

function(x)  exprob (x,  y=7), 

'  Prob  (  HbA  1  c  >  7.5)  '  = 
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function(x)  exprob (x ,  y=7.5)), 
f un . at =list ( seq (5 ,  8,  by=.5), 

c(5,5.25  ,5.5,5.75  ,6,6.25)  , 
c  (5 .5  ,6 ,6 .5  ,7 ,8 , 10 , 12 , 14)  , 
c(.01  ,  .05,  .1,  .2,  .3,  .4), 
c(.01  ,  .05,  .1,  .2,  .3,  .4), 
c  (  . 01  ,  . 05  ,  . 1  ,  . 2  ,  .  3  ,  .  4) ) ) 
plot (nom ,  lmgp=.28)  #  Figure  15.14 
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Fig.  15.14  Nomogram  for  predicting  median,  mean,  and  0.9  quantile  of  glycohe¬ 
moglobin,  along  with  the  estimated  probability  that  HbAic>  6.5,  7,  or  7.5,  all  from 
the  log-log  ordinal  model 


Chapter  16 

Transform-Both-Sides  Regression 


16.1  Background 


Fitting  multiple  regression  models  by  the  method  of  least  squares  is  one  of  the 
most  commonly  used  methods  in  statistics.  There  are  a  number  of  challenges 
to  the  use  of  least  squares,  even  when  it  is  only  used  for  estimation  and  not 
inference,  including  the  following. 


1.  How  should  continuous  predictors  be  transformed  so  as  to  get  a  good  fit? 

2.  Is  it  better  to  transform  the  response  variable?  How  does  one  find  a  good 
transformation  that  simplifies  the  right-hand  side  of  the  equation? 

3.  What  if  Y  needs  to  be  transformed  non-monotonically  (e.g.,  | Y  —  100 1) 
before  it  will  have  any  correlation  with  X ? 


When  one  is  trying  to  draw  an  inference  about  population  effects  using  con¬ 
fidence  limits  or  hypothesis  tests,  the  most  common  approach  is  to  assume 
that  the  residuals  have  a  normal  distribution.  This  is  equivalent  to  assuming 
that  the  conditional  distribution  of  the  response  Y  given  the  set  of  predictors 
X  is  normal  with  mean  depending  on  X  and  variance  that  is  (one  hopes) 
a  constant  independent  of  X.  The  need  for  a  distributional  assumption  to 
enable  us  to  draw  inferences  creates  a  number  of  other  challenges  such  as  the 
following. 


1.  If  for  the  untransformed  original  scale  of  the  response  Y  the  distribution  of 
the  residuals  is  not  normal  with  constant  spread,  ordinary  methods  will  not 
yield  correct  inferences  (e.g.,  confidence  intervals  will  not  have  the  desired 
coverage  probability  and  the  intervals  will  need  to  be  asymmetric). 

2.  Quite  often  there  is  a  transformation  of  Y  that  will  yield  well-behaving 
residuals.  How  do  you  find  this  transformation?  Can  you  find  a  transfor¬ 
mation  for  the  Xs  at  the  same  time? 
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3.  All  classical  statistical  inferential  methods  assume  that  the  full  model  was 
pre-specified,  that  is,  the  model  was  not  modified  after  examining  the  data. 
How  does  one  correct  confidence  limits,  for  example,  for  data-based  model 
and  transformation  selection? 


16.2  Generalized  Additive  Models 

Hastie  and  Tibshirani  have  developed  generalized  additive  models  (GAMs) 
for  a  variety  of  distributions  for  Y .  There  are  semiparametric  GAMs,  but 
most  GAMs  for  continuous  Y  assume  that  the  conditional  distribution  of  Y  is 
from  a  specific  distribution  family.  GAMs  nicely  estimate  the  transformation 
each  continuous  A  requires  so  as  to  optimize  a  fitting  criterion  such  as  sum 
of  squared  errors  or  log  likelihood,  subject  to  the  degrees  of  freedom  the 
analyst  desires  to  spend  on  each  predictor.  However,  GAMs  assume  that  Y 
has  already  been  transformed  to  fit  the  specified  distribution  family. 

There  is  excellent  software  available  for  fitting  a  wide  variety  of  GAMs, 
such  as  the  R  packages  gam,  mgcv,  and  robustgam. 


16.3  Nonparametric  Estimation  of  ^-Transformation 

When  the  model’s  left-hand  side  also  needs  transformation,  either  to  im¬ 
prove  R2  or  to  achieve  constant  variance  of  the  residuals  (which  increases  the 
chances  of  satisfying  a  normality  assumption),  there  are  a  few  approaches 
available.  One  approach  is  Breiman  and  Friedman’s  alternating  conditional 
expectation  (ACE)  method.68  ACE  simultaneously  transforms  both  Y  and 
each  of  the  As  so  as  to  maximize  the  multiple  R2  between  the  transformed 
Y  and  the  transformed  As.  The  model  is  given  by 

g(Y)  =  A  (Ax)  +  /2(A2)  +  . . .  +  fp( Xp).  (16.1) 

ACE  allows  the  analyst  to  impose  restrictions  on  the  transformations  such 
as  monotonicity.  It  allows  for  categorical  predictors,  whose  categories  will 
automatically  be  given  numeric  scores.  The  transformation  for  Y  is  allowed  to 
be  non-monotonic.  One  feature  of  ACE  is  its  ability  to  estimate  the  maximal 
correlation  between  an  A  and  the  response  Y .  Unlike  the  ordinary  correlation 
coefficient  (which  assumes  linearity)  or  Spearman’s  rank  correlation  (which 
assumes  monotonicity),  the  maximal  correlation  has  the  property  that  it  is 
zero  if  and  only  if  A  and  Y  are  statistically  independent.  This  property  holds 
because  ACE  allows  for  non- monotonic  transformations  of  all  variables.  The 
“super  smoother”  (see  the  S  supsmu  function)  is  the  basis  for  the  nonparametric 
estimation  of  transformations  for  continuous  As. 
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Tibshirani  developed  a  different  algorithm  for  nonpar ametric  additive 
regression  based  on  least  squares,  additivity  and  variance  stabilization 
(AVAS).60'  Unlike  ACE,  AVAS  forces  g(Y)  to  be  monotonic.  AVAS’s  fit¬ 
ting  criterion  is  to  maximize  R2  while  forcing  the  transformation  for  Y  to 
result  in  nearly  constant  variance  of  residuals.  The  model  specification  is  the 
same  as  for  ACE  (Equation  16.3). 

ACE  and  AVAS  are  powerful  fitting  algorithms,  but  they  can  result  in  over¬ 
fitting  (R2  can  be  greatly  inflated  when  one  fits  many  predictors),  and  they 
provide  no  statistical  inferential  measures.  As  discussed  earlier,  the  process  of 
estimating  transformations  (especially  those  for  Y )  can  result  in  significant 
variance  under-estimation,  especially  for  small  sample  sizes.  The  bootstrap 
can  be  used  to  correct  the  apparent  R2  (. R2pp )  for  overfitting.  As  before, 
it  estimates  the  optimism  (bias)  in  R2pp  and  subtracts  this  optimism  from 
R2pp  to  get  a  more  trustworthy  estimate.  The  bootstrap  can  also  be  used  to 
compute  confidence  limits  for  all  estimated  transformations,  and  confidence 
limits  for  estimated  predictor  effects  that  take  fully  into  account  the  uncer¬ 
tainty  associated  with  the  transformations.  To  do  this,  all  steps  involved  in 
fitting  the  additive  models  must  be  repeated  fresh  for  each  re-sample. 

Limited  testing  has  shown  that  the  sample  size  needs  to  exceed  100  for 
ACE  and  AVAS  to  provide  stable  estimates.  In  small  sample  sizes  the  boot¬ 
strap  bias-corrected  estimate  of  R2  will  be  zero  because  the  sample  informa¬ 
tion  did  not  support  simultaneous  estimation  of  all  transformations. 


16.4  Obtaining  Estimates  on  the  Original  Scale 

A  common  practice  in  least  squares  fitting  is  to  attempt  to  rectify  lack  of 
fit  by  taking  parametric  transformations  of  Y  before  fitting;  the  logarithm 
is  the  most  common  transformation.21  If  after  transformation  the  model’s 
residuals  have  a  population  median  of  zero,  the  inverse  transformation  of  a 
predicted  transformed  value  estimates  the  population  median  of  Y  given  X. 
This  is  because  unlike  means,  quantiles  are  transformation-preserving.  Many 
analysts  make  the  mistake  of  not  reporting  which  population  parameter  is 
being  estimated  when  inverse  transforming  X/3,  and  sometimes  they  even 
report  that  the  mean  is  being  estimated. 

How  would  one  go  about  estimating  the  population  mean  or  other  param¬ 
eter  on  the  untransformed  scale?  If  the  residuals  are  assumed  to  be  normally 
distributed  and  if  log(V)  is  the  transformation,  the  mean  of  the  log-normal 
distribution,  a  function  of  both  the  mean  and  the  variance  of  the  residuals, 
can  be  used  to  derive  the  desired  quantity.  However,  if  the  residuals  are  not 
normally  distributed,  this  procedure  will  not  result  in  the  correct  estimator. 

a  A  disadvantage  of  transform-both-sides  regression  is  this  difficulty  of  interpreting 
estimates  on  the  original  scale.  Sometimes  the  use  of  a  special  generalized  linear  model 
can  allow  for  a  good  fit  without  transforming  Y . 
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Duan165  developed  a  “smearing”  estimator  for  more  nonparametrically  ob¬ 
taining  estimates  of  parameters  on  the  original  scale.  In  the  simple  one-sample 
case  without  predictors  in  which  one  has  computed  0  =  Y17=i  log(U)/n,  the 
residuals  from  this  fitted  value  are  given  by  =  log(V)  —  0.  The  smearing 
estimator  of  the  population  mean  is  J]exp[0  +  e^J/n.  In  this  simple  case  the 
result  is  the  ordinary  sample  mean  Y. 

The  worth  of  Duan’s  smearing  estimator  is  in  regression  modeling.  Sup¬ 
pose  that  the  regression  was  run  on  g(Y)  from  which  estimated  values 

A  /\ 

g(Yi)  =  Xif3  and  residuals  on  the  transformed  scale  ei  =  g{Y%)  ~  Xi/3  were  ob¬ 
tained.  Instead  of  restricting  ourselves  to  estimating  the  population  mean,  let 
W(yi,  2/2 ,  •  •  • ,  yn)  denote  any  function  of  a  vector  of  untransformed  response 
values.  To  estimate  the  population  mean  in  the  homogeneous  one-sample 
case,  W  is  the  simple  average  of  all  of  its  arguments.  To  estimate  the  pop¬ 
ulation  0.25  quantile,  W  is  the  sample  0.25  quantile  of  yi, . . .  ,yn-  Then  the 
smearing  estimator  of  the  population  parameter  estimated  by  W  given  X  is 
W ( g~1(a  +  e\ ),g~1(a  +  e2), . . . ,  g_1(a  +  en)),  where  g~l  is  the  inverse  of  the 

A 

g  transformation  and  a  =  X/3. 

When  using  the  AVAS  algorithm,  the  monotonic  transformation  g  is  es¬ 
timated  from  the  data,  and  the  predicted  value  of  g(Y)  is  given  by  Equa¬ 
tion  16.3.  So  we  extend  the  smearing  estimator  asW(g_1(a  +  ei),...,g_1(a-|- 
en)),  where  a  is  the  predicted  transformed  response  given  X.  As  g  is  non- 
parametric  (i.e.,  a  table  look-up),  the  areg.boot  function  described  below 
computes  g~x  using  reverse  linear  interpolation. 

If  residuals  from  g{Y)  are  assumed  to  be  symmetrically  distributed,  their 
population  median  is  zero  and  we  can  estimate  the  median  on  the  untrans- 

formed  scale  by  computing  g~l(Xfi).  To  be  safe,  areg.boot  adds  the  median 

/\ 

residual  to  X/3  when  estimating  the  population  median  (the  median  residual 
can  be  ignored  by  specifying  statistic=5f  itted5  to  functions  that  operate  on 
objects  created  by  areg.boot). 

When  quantiles  of  Y  are  of  major  interest,  a  more  direct  way  to  obtain 
estimates  is  through  the  use  of  quantile  regression  ' .  An  excellent  case  study 
including  comparisons  with  other  methods  such  as  Cox  regression  can  be 
found  in  Austin  et  al.38. 


16.5  R  Functions 

The  R  acepack  package’s  ace  function  implements  all  the  features  of  the  ACE 
algorithm,  and  its  avas  function  does  likewise  for  AVAS.  The  bootstrap  and 
smearing  capabilities  mentioned  above  are  offered  for  these  estimation  func¬ 
tions  by  the  areg.boot  (“additive  regression  using  the  bootstrap”)  function 
in  the  Hmisc  package.  Unlike  the  ace  and  avas  functions,  areg.boot  uses  the 
R  modeling  language,  making  it  easier  for  the  analyst  to  specify  the  predic- 
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tor  variables  and  what  is  assumed  about  their  relationships  with  the  trans¬ 
formed  Y.  areg.boot  also  implements  a  parametric  transform-both-sides  ap¬ 
proach  using  restricted  cubic  splines  and  canonical  variates,  and  offers  various 
estimation  options  with  and  without  smearing.  It  can  estimate  the  effect  of 
changing  one  predictor,  holding  others  constant,  using  the  ordinary  bootstrap 
to  estimate  the  standard  deviation  of  difference  in  two  possibly  transformed 
estimates  (for  two  values  of  X),  assuming  normality  of  such  differences.  Nor¬ 
mality  is  assumed  to  avoid  generating  a  large  number  of  bootstrap  replica¬ 
tions  of  time-consuming  model  fits.  It  would  not  be  very  difficult  to  add  non- 
parametric  bootstrap  confidence  limit  capabilities  to  the  software,  areg.boot 
re-samples  every  aspect  of  the  modeling  process  it  uses,  just  as  Faraway186 
did  for  parametric  least  squares  modeling. 

areg.boot  implements  a  variety  of  methods  as  shown  in  the  simple  exam¬ 
ple  below.  The  monotone  function  restricts  a  variable’s  transformation  to  be 
monotonic,  while  the  I  function  restricts  it  to  be  linear. 


L 

f  V-  areg.boot  (Y  ^  monot  one  (  age  )  + 

sex  +  weight  +  I (blood . pressure ) ) 

plot  ( f ) 

#show  transformations ,  CLs 

Function  ( f ) 

# generate  S  functions 
# defining  transformations 

predict  ( f  ) 

#get  predictions ,  smearing  estimates 

summary  ( f ) 

# compute  CLs  on  effects  of  each  X 

smear ingEst  ( ) 

# generalized  smearing  estimators 

Mean  ( f ) 

# derive  S  function  to 
# compute  smearing  mean  Y 

Quantile  (f  ) 

# derive  function  to  compute  smearing  quantile 

The  methods  are  best  described  in  a  case  study. 


16.6  Case  Study 

Consider  simulated  data  where  the  conditional  distribution  of  Y  is  log-normal 
given  X,  but  where  transform-both-sides  regression  methods  use  unlogged 
Y.  Predictor  X\  is  linearly  related  to  log  T,  X 2  is  related  by  \X2  —  ^|,  and 
categorical  X3  has  reference  group  a  effect  of  zero,  group  b  effect  of  0.3,  and 
group  c  effect  of  0.5. 
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#  For  reference  fit  appropriate  OLS  model 
print  ( ols  ( log  (y)  ~  xl  +  res  (x2  ,  5)  +  x3),  coefs=FALSE, 

latex  =TRUE ) 


Linear  Regression  Model 
ols(formula  =  log(y)  xl  +  rcs(x2,  5)  +  x3) 


Model  Likelihood 
Ratio  Test 

Discrimination 

Indexes 

Obs  400 
a  0.4722 

d.f.  392 

LR  x2  236.87 

d.f.  7 

Pr(>  x2)  0.0000 

H2  0.447 

R2a.  0.437 

adj 

g  0.482 

Residuals 

Min  IQ  Median  3Q  Max 
-1.346  -0.3075  -0.0134  0.327  1.527 


Now  fit  the  avas  model.  We  use  300  bootstrap  repetitions  but  only  plot 
the  first  20  estimates  to  see  clearly  how  the  bootstrap  re-estimates  of  trans¬ 
formations  vary.  Had  we  wanted  to  restrict  transformations  to  be  linear,  we 
would  have  specified  the  identity  function,  for  example,  I(xl). 


avas  Additive  Regression  Model 

areg . boot (x  =  y  ~  xl  +  x2  +  x3 ,  B  =  300,  method  =  "avas") 


Predictor  Types 

type 
xl  s 

x2  s 

x3  c 

y  type:  s 

n=  400  p=  3 

Apparent  R2  on  transformed  Y  scale:  0.444 
Bootstrap  validated  R2  :  0.42 

Coefficients  of  standardized  transformations: 

Intercept  xl  x2  x3 

-3 . 4431 1 1 e - 1 6  9.702960e-01  1.224320e+00  9.881150e-01 


Residuals  on  transformed  scale: 
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Min  IQ  Median  3Q  Max 

-1 . 877152e  +  00  -5 . 252194e -01  -3 . 732200e -02  5.339122e-01  2.172680e  +  00 
Mean  S . D . 

8 . 673617  e-19  7.420788e-01 

Note  that  the  coefficients  above  do  not  mean  very  much  as  the  scale  of  the 
transformations  is  arbitrary.  We  see  that  the  model  was  very  slightly  overfit¬ 
ted  ( R 2  dropped  from  0.44  to  0.42),  and  the  R2  are  in  agreement  with  the 
OLS  model  fit  above. 

Next  we  plot  the  transformations,  0.95  confidence  bands,  and  a  sample  of 
the  bootstrap  estimates. 

plot  (f  ,  boot=20)  #  Figure  16.1 


0.0  0.2  0.4  0.6  0.8  1.0 


x2 


x3 


Fig.  16.1  avas  transformations:  overall  estimates,  pointwise  0.95  confidence  bands, 
and  20  bootstrap  estimates  (red  lines). 


The  plot  is  shown  in  Figure  16.1.  The  nonpar ametrically  estimated  transfor¬ 
mation  of  xl  is  almost  linear,  and  the  transformation  of  x2  is  close  to  \x2— 0.5  . 
We  know  that  the  true  transformation  of  y  is  log(?/),  so  variance  stabilization 
and  normality  of  residuals  will  be  achieved  if  the  estimated  y-transformation 
is  close  to  log (y). 
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ys  V-  seq(.8,  20,  length=200) 

ytrans  V-  Funct  i  on  ( f )  $y  #  Function  outputs  all  transforms 
plot ( log ( ys ) ,  ytrans (ys),  type='l')  #  Figure  16.2 
abl ine ( lm ( ytrans ( ys )  ~  log(ys)),  col =gray ( . 8 ) ) 


0.0  0.5  1.0  1.5  2.0  2.5  3.0 

log(ys) 

Fig.  16.2  Checking  estimated  against  optimal  transformation 


Approximate  linearity  indicates  that  the  estimated  transformation  is  very 
log- like. b 

Now  let  us  obtain  approximate  tests  of  effects  of  each  predictor,  summary 
does  this  by  setting  all  other  predictors  to  reference  values  (e.g.,  medians), 
and  comparing  predicted  responses  for  a  given  level  of  the  predictor  X  with 
predictions  for  the  lowest  setting  of  X.  The  default  predicted  response  for 
summary  is  the  median,  which  is  used  here.  Therefore  tests  are  for  differences 
in  medians. 

L 

summary (f ,  values =list (xl=c (. 2 ,  .8),  x2=c(.l,  .5))) 

summary . areg . boot ( obj ect  =  f,  values  =  list(xl  =  c(0.2,  0.8), 

x2  =  c  (0 . 1  ,  0.5))) 

Estimates  based  on  300  resamples 


Values  to  which  predictors  are  set  when  estimating 
effects  of  other  predictors: 

y  xl  x2  x3 

3.728843  0.500000  0.300000  2.000000 


b  Beware  that  use  of  a  data-derived  transformation  in  an  ordinary  model,  as  this  will 
result  in  standard  errors  that  are  too  small.  This  is  because  model  selection  is  not 
taken  into  account.186 
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Estimates  of  differences  of  effects  on  Median  Y  (from  first  X 
value),  and  bootstrap  standard  errors  of  these  differences. 
Settings  for  X  are  shown  as  row  headings. 


Predictor:  xl 


x  Dif  f  erences 

0.2  0.000000 

0.8  1.546992 


S . E  Lower  0 . 95 
NA  NA 

0.2099959  1.135408 


Upper  0 . 95 
NA 

1 . 958577 


Z 

NA 

7 . 366773 


Pr(IZl) 

NA 

1 .747491e-13 


Predictor:  x2 


x  Dif  f  erences 

0.1  0.000000 
0.5  -1.658961 


S  .  E 
NA 

0 .3163361 


Lower  0 . 95 
NA 

-2 . 278968 


Upper  0 . 95 
NA 

-1 . 038953 


Z  Pr(|Z|) 

NA  NA 

-5.244298  1.568786e-07 


Predictor:  x3 


X 

Dif  f  erences 

S  .  E 

Lower  0 . 95 

Upper  0 . 95 

Z 

Pr(IZl) 

a 

0 . 0000000 

NA 

NA 

NA 

NA 

NA 

b 

0 . 8447422 

0 . 1768244 

0 . 4981728 

1 . 191312 

4 . 777295 

1 . 776692e-06 

c 

1 . 3526151 

0 . 2206395 

0 . 9201697 

1 . 785061 

6 . 130431 

8 . 764127  e-10 

For  example,  when  xl  increases  from  0.2  to  0.8  we  predict  an  increase  in 
median  y  by  1.55  with  bootstrap  standard  error  0.21,  when  all  other  predictors 
are  held  to  constants.  Setting  them  to  other  constants  will  yield  different 
estimates  of  the  xl  effect,  as  the  transformation  of  y  is  nonlinear. 


Next  depict  the  fitted  model  by  plotting  predicted  values,  with  x2  varying 
on  the  x-axis,  and  three  curves  corresponding  to  three  values  of  x3.  xl  is  set 
to  0.5.  Figure  16.3  shows  estimates  of  both  the  median  and  the  mean  y. 


newdat 

expand . 

grid (x2=seq( .05 , 

.95 

,  length 

=  200)  , 

x3  =  c (  '  a  '  ,  '  b  ' 

,  '  c 

' ) ,  xl = . 

5  , 

statistic=c ( 

'  me 

dian  '  ,  '  m 

ean  '  )  ) 

yhat  V- 

c 

( predict 

(f ,  subset (newdat 

,  s 

t  at  i  s  t  i  c 

==  ' median  '  )  , 

statistic= 'median  ') 

9 

predict 

(f ,  subset (newdat 

,  s 

t  at  i  s  t  i  c 

==  'mean  '  )  , 

statistic=  'mean  ' 

)) 

newdat 

e- 

upDat 

a  ( 

newdat  , 

lp  =  xl 

+  2  *  abs ( x2  -  .5) 

+ 

3* (x3== ' 

b  '  )  + 

.5* 

( x3  ==  '  c  '  )  , 

ytrue  = 

ifelse (statistic= 

=  '  m 

edian  '  , 

exp ( lp)  , 

exp ( lp 

+  0.5*(0.5a2))) , 

pr 

=  FALSE  ) 

Input 

object  size: 

45472  bytes  ;  4  variables 

Added 

variable 

IP 

Added 

variable 

ytrue 

Added 

variable 

pr 

New  object  size:  69800  bytes;  7  variables 


#  Use  Hmisc  function  xYplot  to  produce  Figure  16.3 
xYplot(yhat  ~  x2  |  statistic,  groups=x3, 
data=newdat ,  type='l',  col=l, 
ylab = expr e s s i on (hat (y ) ) , 
panel = f unct i on  (...  )  { 
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) 


} 


panel.xYplot  (  .  .  . ) 
dat  subset (newdat , 

statistic==c(  'median  '  ,  'mean  ')  [current. column  () ] ) 
for (w  in  c('a','b','c')) 
with(subset (dat ,  x3==w) , 

Hines  (x2  ,  ytrue  ,  col  =  gray  (  .  7  )  ,  lwd=1.5)) 


0.2  0.4  0.6  0.8 


x2 


Fig.  16.3  Predicted  median  (left  panel)  and  mean  (right  panel)  y  as  a  function  of 
x2  and  x3.  True  population  values  are  shown  in  gray. 


Chapter  17 

Introduction  to  Survival  Analysis 


17.1  Background 


Suppose  that  one  wished  to  study  the  occurrence  of  some  event  in  a  popu¬ 
lation  of  subjects.  If  the  time  until  the  occurrence  of  the  event  were  unim¬ 
portant,  the  event  could  be  analyzed  as  a  binary  outcome  using  the  logistic 
regression  model.  For  example,  in  analyzing  mortality  associated  with  open 
heart  surgery,  it  may  not  matter  whether  a  patient  dies  during  the  proce¬ 
dure  or  he  dies  after  being  in  a  coma  for  two  months.  For  other  outcomes, 
especially  those  concerned  with  chronic  conditions,  the  time  until  the  event 
is  important.  In  a  study  of  emphysema,  death  at  eight  years  after  onset  of 
symptoms  is  different  from  death  at  six  months.  An  analysis  that  simply 
counted  the  number  of  deaths  would  be  discarding  valuable  information  and 
sacrificing  statistical  power. 

Survival  analysis  is  used  to  analyze  data  in  which  the  time  until  the  event 
is  of  interest.  The  response  variable  is  the  time  until  that  event  and  is  often 
called  a  failure  time ,  survival  time ,  or  event  time.  Examples  of  responses 
of  interest  include  the  time  until  cardiovascular  death,  time  until  death  or 
myocardial  infarction,  time  until  failure  of  a  light  bulb,  time  until  pregnancy, 
or  time  until  occurrence  of  an  ECG  abnormality  during  exercise.  Bull  and 
Spiegelhalter8  have  an  excellent  overview  of  survival  analysis. 

The  response,  event  time,  is  usually  continuous,  but  survival  analysis  al¬ 
lows  the  response  to  be  incompletely  determined  for  some  subjects.  For  exam¬ 
ple,  suppose  that  after  a  five-year  follow-up  study  of  survival  after  myocardial 
infarction  a  patient  is  still  alive.  That  patient’s  survival  time  is  censored  on 
the  right  at  five  years;  that  is,  her  survival  time  is  known  only  to  exceed  five 
years.  The  response  value  to  be  used  in  the  analysis  is  5+.  Censoring  can  also 
occur  when  a  subject  is  lost  to  follow-up. 

If  no  responses  are  censored,  standard  regression  models  for  continuous 
responses  could  be  used  to  analyze  the  failure  times  by  writing  the  ex¬ 
pected  failure  time  as  a  function  of  one  or  more  predictors,  assuming  that 
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the  distribution  of  failure  time  is  properly  specified.  However,  there  are  still 
several  reasons  for  studying  failure  time  using  the  specialized  methods  of 
survival  analysis. 

1.  Time  to  failure  can  have  an  unusual  distribution.  Failure  time  is  restricted 
to  be  positive  so  it  has  a  skewed  distribution  and  will  never  be  normally 
distributed. 

2.  The  probability  of  surviving  past  a  certain  time  is  often  more  relevant  than 
the  expected  survival  time  (and  expected  survival  time  may  be  difficult  to 
estimate  if  the  amount  of  censoring  is  large). 

3.  A  function  used  in  survival  analysis,  the  hazard  function,  helps  one  to 
understand  the  mechanism  of  failure.308 

Survival  analysis  is  used  often  in  industrial  life-testing  experiments,  and 
it  is  heavily  used  in  clinical  and  epidemiologic  follow-up  studies.  Examples 
include  a  randomized  trial  comparing  a  new  drug  with  placebo  for  its  ability 
to  maintain  remission  in  patients  with  leukemia,  and  an  observational  study 
of  prognostic  factors  in  coronary  heart  disease.  In  the  latter  example  subjects 
may  well  be  followed  for  varying  lengths  of  time,  as  they  may  enter  the  study 
over  a  period  of  many  years. 

When  regression  models  are  used  for  survival  analysis,  all  the  advantages 
of  these  models  can  be  brought  to  bear  in  analyzing  failure  times.  Multiple, 
independent  prognostic  factors  can  be  analyzed  simultaneously  and  treatment 
differences  can  be  assessed  while  adjusting  for  heterogeneity  and  imbalances 
in  baseline  characteristics.  Also,  patterns  in  outcome  over  time  can  be  pre¬ 
dicted  for  individual  subjects. 

Even  in  a  simple  well-designed  experiment,  survival  modeling  can  allow 
one  to  do  the  following  in  addition  to  making  simple  comparisons. 

1.  Test  for  and  describe  interactions  with  treatment.  Subgroup  analyses  can 
easily  generate  spurious  results  and  they  do  not  consider  interacting  fac¬ 
tors  in  a  dose-response  manner.  Once  interactions  are  modeled,  relative 
treatment  benefits  can  be  estimated  (e.g.,  hazard  ratios),  and  analyses 
can  be  done  to  determine  if  some  patients  are  too  sick  or  too  well  to  have 
even  a  relative  benefit. 

2.  Understand  prognostic  factors  (strength  and  shape). 

3.  Model  absolute  effect  of  treatment.  First,  a  model  for  the  probability  of 
surviving  past  time  t  is  developed.  Then  differences  in  survival  probabilities 
for  patients  on  treatments  A  and  B  can  be  estimated.  The  differences  will 
be  due  primarily  to  sickness  (overall  risk)  of  the  patient  and  to  treatment 
interactions. 

4.  Understand  time  course  of  treatment  effect.  The  period  of  maximum  effect 
or  period  of  any  substantial  effect  can  be  estimated  from  a  plot  of  relative 
effects  of  treatment  over  time. 

5.  Gain  power  for  testing  treatment  effects. 

6.  Adjust  for  imbalances  in  treatment  allocation  in  non-randomized  studies. 
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Responses  may  be  left-censored  and  interval-censored  besides  being  right- 
censored.  Interval-censoring  is  present,  for  example,  when  a  measuring  device 
functions  only  for  a  certain  range  of  the  response;  measurements  outside  that 
range  are  censored  at  an  end  of  the  scale  of  the  device.  Interval-censoring  also 
occurs  when  the  presence  of  a  medical  condition  is  assessed  during  periodic  ex¬ 
ams.  When  the  condition  is  present,  the  time  until  the  condition  developed  is 
only  known  to  be  between  the  current  and  the  previous  exam.  Left-censoring 
means  that  an  event  is  known  to  have  occurred  before  a  certain  time.  In  addi¬ 
tion,  left-truncation  and  delayed  entry  are  common.  Nomenclature  is  confus¬ 
ing  as  many  authors  refer  to  delayed  entry  as  left -truncation.  Left -truncation 
really  means  that  an  unknown  subset  of  subjects  failed  before  a  certain  time 
and  the  subjects  didn’t  get  into  the  study.  For  example,  one  might  study  the 
survival  patterns  of  patients  who  were  admitted  to  a  tertiary  care  hospital. 
Patients  who  didn’t  survive  long  enough  to  be  referred  to  the  hospital  com¬ 
pose  the  left-truncated  group,  and  interesting  questions  such  as  the  optimum 
timing  of  admission  to  the  hospital  cannot  be  answered  from  the  data  set. 

Delayed  entry  occurs  in  follow-up  studies  when  subjects  are  exposed  to  the 
risk  of  interest  only  after  varying  periods  of  survival.  For  example,  in  a  study 
of  occupational  exposure  to  a  toxic  compound,  researchers  may  be  interested 
in  comparing  life  length  of  employees  with  life  expectancy  in  the  general 
population.  A  subject  must  live  until  the  beginning  of  employment  before 
exposure  is  possible;  that  is,  death  cannot  be  observed  before  employment. 
The  start  of  follow-up  is  delayed  until  the  start  of  employment  and  it  may  be 
right-censored  when  follow-up  ends.  In  some  studies,  a  researcher  may  want 
to  assume  that  for  the  purpose  of  modeling  the  shape  of  the  hazard  function, 
time  zero  is  the  day  of  diagnosis  of  disease,  while  patients  enter  the  study 
at  various  times  since  diagnosis.  Delayed  entry  occurs  for  patients  who  don’t 
enter  the  study  until  some  time  after  their  diagnosis.  Patients  who  die  before 
study  entry  are  left-truncated.  Note  that  the  choice  of  time  origin  is  very 
important.53, 83, 112, 133 

Heart  transplant  studies  have  been  analyzed  by  considering  time  zero  to  be 
the  time  of  enrollment  in  the  study.  Pre-transplant  survival  is  right-censored 
at  the  time  of  transplant.  Transplant  survival  experience  is  based  on  delayed 
entry  into  the  “risk  set”  to  recognize  that  a  transplant  patient  is  not  at  risk 
of  dying  from  transplant  failure  until  after  a  donor  heart  is  found.  In  other 
words,  survival  experience  is  not  credited  to  transplant  surgery  until  the  day 
of  transplant.  Comparisons  of  transplant  experience  with  medical  treatment 
suffer  from  “waiting  time  bias”  if  transplant  survival  begins  on  the  day  of 
transplant  instead  of  using  delayed  entry.209, 438,570 

There  are  several  planned  mechanisms  by  which  a  response  is  right- 
censored.  Fixed  type  I  censoring  occurs  when  a  study  is  planned  to  end  af¬ 
ter  two  years  of  follow-up,  or  when  a  measuring  device  will  only  measure 
responses  up  to  a  certain  limit.  There  the  responses  are  observed  only  if  they 
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fall  below  a  fixed  value  C.  In  type  II  censoring ,  a  study  ends  when  there  is 
a  pre-specified  number  of  events.  If,  for  example,  100  mice  are  followed  until 
50  die,  the  censoring  time  is  not  known  in  advance. 

We  are  concerned  primarily  with  random  type  I  right- censoring  in  which 
each  subject’s  event  time  is  observed  only  if  the  event  occurs  before  a  certain 
time,  but  the  censoring  time  can  vary  between  subjects.  Whatever  the  cause 
of  censoring,  we  assume  that  the  censoring  is  non-inf ormative  about  the  event; 
that  is,  the  censoring  is  caused  by  something  that  is  independent  of  the  im¬ 
pending  failure.  Censoring  is  non-informative  when  it  is  caused  by  planned 
termination  of  follow-up  or  by  a  subject  moving  out  of  town  for  reasons  unre¬ 
lated  to  the  risk  of  the  event.  If  subjects  are  removed  from  follow-up  because 
of  a  worsening  condition,  the  informative  censoring  will  result  in  biased  esti¬ 
mates  and  inaccurate  statistical  inference  about  the  survival  experience.  For 
example,  if  a  patient’s  response  is  censored  because  of  an  adverse  effect  of 
a  drug  or  noncompliance  to  the  drug,  a  serious  bias  can  result  if  patients 
with  adverse  experiences  or  noncompliance  are  also  at  higher  risk  of  suffering 
the  outcome.  In  such  studies,  efficacy  can  only  be  assessed  fairly  using  the 
intention  to  treat  principle :  all  events  should  be  attributed  to  the  treatment 
assigned  even  if  the  subject  is  later  removed  from  that  treatment. 


17.3  Notation,  Survival,  and  Hazard  Functions 

In  survival  analysis  we  use  T  to  denote  the  response  variable,  as  the  response 
is  usually  the  time  until  an  event.  Instead  of  defining  the  statistical  model 
for  the  response  T  in  terms  of  the  expected  failure  time,  it  is  advantageous 
to  define  it  in  terms  of  the  survival  function ,  Sft),  given  by 

S(t)  =  Prob{T  >t}  =  1  -  Fft),  (17.1) 

where  Fft)  is  the  cumulative  distribution  function  for  T.  If  the  event  is  death, 
Sft)  is  the  probability  that  death  occurs  after  time  t,  that  is,  the  probability 
that  the  subject  will  survive  at  least  until  time  t.  S(t)  is  always  1  at  t  =  0; 
all  subjects  survive  at  least  to  time  zero.  The  survival  function  must  be 
non-increasing  as  t  increases.  An  example  of  a  survival  function  is  shown  in 
Figure  17.1.  In  that  example  subjects  are  at  very  high  risk  of  the  event  in  the 
early  period  so  that  the  Sft)  drops  sharply.  The  risk  is  low  for  0.1  <  t  <  0.6,  so 
Sft)  is  somewhat  fiat.  After  t  =  .6  the  risk  again  increases,  so  Sft)  drops  more 
quickly.  Figure  17.2  depicts  the  cumulative  hazard  function  corresponding 
to  the  survival  function  in  Figure  17.1.  This  function  is  denoted  by  Aft). 
It  describes  the  accumulated  risk  up  until  time  t,  and  as  is  shown  later, 
is  the  negative  of  the  log  of  the  survival  function.  Aft)  is  non-decreasing 
as  t  increases;  that  is,  the  accumulated  risk  increases  or  remains  the  same. 
Another  important  function  is  the  hazard  function,  X ft),  also  called  the  force 
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Fig.  17.1  Survival  function 


Fig.  17.2  Cumulative  hazard  function 


of  mortality,  or  instantaneous  event  (death,  failure)  rate.  The  hazard  at  time 
t  is  related  to  the  probability  that  the  event  will  occur  in  a  small  interval 
around  t,  given  that  the  event  has  not  occurred  before  time  t.  By  studying 
the  event  rate  at  a  given  time  conditional  on  the  event  not  having  occurred  by 
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Fig.  17.3  Hazard  function 


that  time,  one  can  learn  about  the  mechanisms  and  forces  of  risk  over  time. 
Figure  17.3  depicts  the  hazard  function  corresponding  to  S(t)  in  Figure  17.1 
and  to  A(t)  in  Figure  17.2.  Notice  that  the  hazard  function  allows  one  to 
more  easily  determine  the  phases  of  increased  risk  than  looking  for  sudden 
drops  in  S(t)  or  A(t). 

The  hazard  function  is  defined  formally  by 


X(t)  =  lim 

u — ^0 


Prob{t  <  T  <  t  +  u\T  >  t} 


u 


(17.2) 


which  using  the  law  of  conditional  probability  becomes 


A  (t)  =  lim 

v  y  n— 


lim 

u — ^0 


Prob{t  <  T  <  t  +  u} /Prob{T  >  t} 


u 


[F(t  +  u)-F(t)]/u 
S(t) 


dF(t)/dt 

S(t) 

m 

s(t )’ 


(17.3) 
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where  f(t)  is  the  probability  density  function  of  T  evaluated  at  £,  the  deriva¬ 
tive  or  slope  of  the  cumulative  distribution  function  1  —  S(t).  Since 


d  log  S(t)  dS(t)/dt  f{t ) 

dt  S(t )  S'(t)’ 

the  hazard  function  can  also  be  expressed  as 


(17.4) 


m  =  (17.5) 

the  negative  of  the  slope  of  the  log  of  the  survival  function.  Working  back¬ 
wards,  the  integral  of  A (t)  is: 


log  S(t). 


(17.6) 


The  integral  or  area  under  A (t)  is  defined  to  be  A(t),  the  cumulative  hazard 
function.  Therefore 


A(t)  =  -logS(t),  (17.7) 


or 


S(t)  =  exp[— A(t)\ 


(17.8) 


So  knowing  any  one  of  the  functions  S(t),  A(t ),  or  A (t)  allows  one  to  derive 
the  other  two  functions.  The  three  functions  are  different  ways  of  describing 
the  same  distribution. 

One  property  of  A(t)  is  that  the  expected  value  of  A(T)  is  unity,  since  if 
T  ~  S(t),  the  density  of  T  is  A (t)S(t)  and 


E[A{T)\  = 


A(t)\(i)  exp(—A(t))dt 


u  exp {—u)du 


(17.9) 


Now  consider  properties  of  the  distribution  of  T.  The  population  qth  quan¬ 
tile  (lOOgth  percentile),  Tq,  is  the  time  by  which  a  fraction  q  of  the  subjects 
will  fail.  It  is  the  value  t  such  that  S(t)  =  1  —  <7;  that  is 

Tq  =  S~1(l-q).  (17.10) 


The  median  life  length  is  the  time  by  which  half  the  subjects  will  fail,  obtained 
by  setting  S(t)  =  0.5: 

5  =  SNO-5).  (17.11) 


The  qth  quantile  of  T  can  also  be  computed  by  setting  exp[— A(t)\  =  1  —  <7, 
giving 
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Tq  =  A  1[—  log(l  —  q )]  and  as  a  special  case, 

T5  =  yl-1(log2).  (17.12) 

The  mean  or  expected  value  of  T  (the  expected  failure  time)  is  the  area  under 
the  survival  function  for  t  ranging  from  0  to  oc: 

(17.13) 

Irwin  has  defined  mean  restricted  life  (see  [334,335]),  which  is  the  area  under 
S(t)  up  to  a  fixed  time  (usually  chosen  to  be  a  point  at  which  there  is  still 
adequate  follow-up  information). 

The  random  variable  T  denotes  a  random  failure  time  from  the  survival 
distribution  S(t).  We  need  additional  notation  for  the  response  and  censoring 
information  for  the  ith  subject.  Let  Ti  denote  the  response  for  the  ith  subject. 
This  response  is  the  time  until  the  event  of  interest,  and  it  may  be  censored 
if  the  subject  is  not  followed  long  enough  for  the  event  to  be  observed.  Let  Ci 
denote  the  censoring  time  for  the  ith  subject,  and  define  the  event  indicator  as 

e{  —  1  if  the  event  was  observed  (Ti  <  Ci), 

=  0  if  the  response  was  censored  (Ti  >  Ci).  (17.14) 

The  observed  response  is 


Yi  =  min (T*,  C*),  (17.15) 

which  is  the  time  that  occurred  first:  the  failure  time  or  the  censoring  time. 
The  pair  of  values  (Y^e^)  contains  all  the  response  information  for  most 
purposes  (i.e.,  the  potential  censoring  time  Ci  is  not  usually  of  interest  if  the 
event  occurred  before  Ci). 

Figure  17.4  demonstrates  this  notation.  The  line  segments  start  at  study 
entry  (survival  time  t  =  0). 

A  useful  property  of  the  cumulative  hazard  function  can  be  derived  as  fol¬ 
lows.  Let  z  be  any  cutoff  time  and  consider  the  expected  value  of  A  evaluated 
at  the  earlier  of  the  cutoff  time  or  the  actual  failure  time. 


E[A(mm(T,  z))\  =  E[A(T)[T  <  z]+A(z)[T  >  z 

=  E[A(T)[T  <  z]\+A(z)S(z). 


(17.16) 


The  first  term  in  the  right-hand  side  is 


•  oo 


A(t)[t  <  z\\(t)  exp (—A(t))dt 


=  f  A(t)X(t)  exp(— A(t))dt 

J  o 


(17.17) 
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Termination  of  Study 

Fig.  17.4  Some  censored  data.  Circles  denote  events. 


=  —  uexp(-u)  +  exp(— u) 
=  1  -  S(z)[A(z)  +  1]. 


A(z) 

o 


Adding  A(z)S(z)  results  in 

E[A(mm(T ,  z))]  =  1  —  S(z)  =  F(z). 


(17.18) 


It  follows  that  J2i=i  A(min(X^,  z))  estimates  the  expected  number  of  failures 
occurring  before  time  z  among  the  n  subjects. 


17.4  Homogeneous  Failure  Time  Distributions 

In  this  section  we  assume  that  each  subject  in  the  sample  has  the  same  dis¬ 
tribution  of  the  random  variable  T  that  represents  the  time  until  the  event. 
In  particular,  there  are  no  covariables  that  describe  differences  between  sub¬ 
jects  in  the  distribution  of  T.  As  before  we  use  S(£),  A (£),  and  A(t)  to  denote, 
respectively,  the  survival,  hazard,  and  cumulative  hazard  functions. 

The  form  of  the  true  population  survival  distribution  function  S(t)  is  al¬ 
most  always  unknown,  and  many  distributional  forms  have  been  used  for 
describing  failure  time  data.  We  consider  first  the  two  most  popular  para¬ 
metric  survival  distributions:  the  exponential  and  Weibull  distributions.  The 
exponential  distribution  is  a  very  simple  one  in  which  the  hazard  function  is 
constant;  that  is,  A (t)  =  A  .  The  cumulative  hazard  and  survival  functions 
are  then 


A(t)  =  A  t  and 

S(t)  =  exp  (—A(t))  =  exp(— A  t).  (17.19) 


The  median  life  length  is  A  x(log2)  or 
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T0.5  =  log(2)/A.  (17.20) 

The  time  by  which  1/2  of  the  subjects  will  have  failed  is  then  proportional  to 
the  reciprocal  of  the  constant  hazard  rate  A  .  This  is  true  also  of  the  expected 
or  mean  life  length,  which  is  1/A. 

The  exponential  distribution  is  one  of  the  few  distributions  for  which  a 
closed-form  solution  exists  for  the  estimator  of  its  parameter  when  censoring 
is  present.  This  estimator  is  a  function  of  the  number  of  events  and  the  total 
person-years  of  exposure.  Methods  based  on  person-years  in  fact  implicitly 
assume  an  exponential  distribution.  The  exponential  distribution  is  often  used 
to  model  events  that  occur  “at  random  in  time.”32"  It  has  the  property  that 
the  future  lifetime  of  a  subject  is  the  same,  no  matter  how  “old”  it  is,  or 


Prob{T  >  to  +  t\T  >  to}  =  Prob{T  >  t}. 


(17.21) 


This  “ageless”  property  also  makes  the  exponential  distribution  a  poor  choice 
for  modeling  human  survival  except  over  short  time  periods. 

The  Weibull  distribution  is  a  generalization  of  the  exponential  distribution. 
Its  hazard,  cumulative  hazard,  and  survival  functions  are  given  by 

A  (t)  =  a^t1-1 

A(t)  =  at 7  (17.22) 

S(t)  =  exp  {—at1). 


The  Weibull  distribution  with  7  =  1  is  an  exponential  distribution  (with 
constant  hazard).  When  7  >  1,  its  hazard  is  increasing  with  t,  and  when 
7  <  1  its  hazard  is  decreasing.  Figure  17.5  depicts  some  of  the  shapes  of 
the  hazard  function  that  are  possible.  If  T  has  a  Weibull  distribution,  the 
median  of  T  is 


[(l°g2)/a]1/7 


(17.23) 


There  are  many  other  traditional  parametric  survival  distributions,  some  of 
which  have  hazards  that  are  “bathtub  shaped”  as  in  Figure  17. 3. 243,323  The 
restricted  cubic  spline  function  described  in  Section  2.4.5  is  an  alternative 
basis  for  A(t). 286,287  This  function  family  allows  for  any  shape  of  smooth 
A (t)  since  the  number  of  knots  can  be  increased  as  needed,  subject  to  the 
number  of  events  in  the  sample.  Nonlinear  terms  in  the  spline  function  can 
be  tested  to  assess  linearity  of  hazard  (Rayleigh-ness)  or  constant  hazard 
(exponentiality) . 

The  restricted  cubic  spline  hazard  model  with  k  knots  is 


k—2 

A k(t )  =  a  +  bt  +  y^iijWj(t), 

3= 1 


(17.24) 
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t 

Fig.  17.5  Some  Weibull  hazard  functions  with  a  =  1  and  various  values  of  7. 


where  the  Wj(t)  are  the  restricted  cubic  spline  terms  of  Equation  2.25.  There 
terms  are  cubic  terms  in  t.  A  set  of  knots  vi,...,Vk  is  selected  from  the 
quantiles  of  the  uncensored  failure  times  (see  Section  2.4.5  and  [286]). 

The  cumulative  hazard  function  for  this  model  is 


1  1 

A(t)  =  at  -f-  —i?  H — -  x 
2  4 


quart  ic  terms  in  t. 


(17.25) 


Standard  maximum  likelihood  theory  is  used  to  obtain  estimates  of  the  k 
unknown  parameters  to  derive,  for  example,  smooth  estimates  of  A (t)  with 
confidence  bands.  The  flexible  estimates  of  S(t)  using  this  method  are  as 
efficient  as  Kaplan-Meier  estimates,  but  they  are  smooth  and  can  be  used  as  a 
basis  for  modeling  predictor  variables.  The  spline  hazard  model  is  particularly 
useful  for  fitting  steeply  falling  and  gently  rising  hazard  functions  that  are 
characteristic  of  high-risk  medical  procedures. 


17.5  Nonparametric  Estimation  of  S  and  A 
17.5.1  Kaplan-Meier  Estimator 

As  the  true  form  of  the  survival  distribution  is  seldom  known,  it  is  useful  to 
estimate  the  distribution  without  making  any  assumptions.  For  many  anal¬ 
yses,  this  may  be  the  last  step,  while  in  others  this  step  helps  one  select  a 
statistical  model  for  more  in-depth  analyses.  When  no  event  times  are  cen¬ 
sored,  a  nonparametric  estimator  of  S(t)  is  1  —  Fn(t)  where  Fn(t)  is  the  usual 
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Table  17.1  Kaplan-Meier  computations 


Day  No.  Subjects  Deaths  Censored  Cumulative 

At  Risk  Survival 


12 

100 

1 

0 

99/100  = .99 

30 

99 

2 

1 

97/99  x  99/100=  .97 

60 

96 

0 

3 

96/96  x  .97  =  .97 

72 

93 

3 

0 

90/93  x  .97  =  .94 

empirical  cumulative  distribution  function  based  on  the  observed  failure  times 
Ti, . . .  ,Tn.  Let  Sn(t)  denote  this  empirical  survival  function.  Sn(t)  is  given 
by  the  fraction  of  observed  failure  times  that  exceed  t: 

Sn(t)  =  [number  of  Ti  >  t\/n.  (17.26) 

When  censoring  is  present,  S(t)  can  be  estimated  (at  least  for  t  up  until 
the  end  of  follow-up)  by  the  Kaplan-Meier333  product-limit  estimator.  This 
method  is  based  on  conditional  probabilities.  For  example,  suppose  that  ev¬ 
ery  subject  has  been  followed  for  39  days  or  has  died  within  39  days  so  that 
the  proportion  of  subjects  surviving  at  least  39  days  can  be  computed.  After 

39  days,  some  subjects  may  be  lost  to  follow-up  besides  those  removed  from 
follow-up  because  of  death  within  39  days.  The  proportion  of  those  still  fol¬ 
lowed  39  days  who  survive  day  40  is  computed.  The  probability  of  surviving 

40  days  from  study  entry  equals  the  probability  of  surviving  day  40  after 
living  39  days,  multiplied  by  the  chance  of  surviving  39  days. 

The  life  table  in  Table  17.1  demonstrates  the  method  in  more  detail.  We 
suppose  that  100  subjects  enter  the  study  and  none  die  or  are  lost  before 
day  12. 

Times  in  a  life  table  should  be  measured  as  precisely  as  possible.  If  the 
event  being  analyzed  is  death,  the  failure  time  should  usually  be  specified 
to  the  nearest  day.  We  assume  that  deaths  occur  on  the  day  indicated  and 
that  being  censored  on  a  certain  day  implies  the  subject  survived  through  the 
end  of  that  day.  The  data  used  in  computing  Kaplan-Meier  estimates  consist 
of  (Y^,ei),z  =  1,2,  ...,n  using  notation  defined  previously.  Primary  data 
collected  to  derive  (1^,  ei)  usually  consist  of  entry  date,  event  date  (if  subject 
failed),  and  censoring  date  (if  subject  did  not  fail).  Instead,  the  entry  date, 
date  of  event/censoring,  and  event /censoring  indicator  may  be  specified. 

The  Kaplan-Meier  estimator  is  called  the  product-limit  estimator  because 
it  is  the  limiting  case  of  actuarial  survival  estimates  as  the  time  periods 
shrink  so  that  an  entry  is  made  for  each  failure  time.  An  entry  need  not 
be  in  the  table  for  censoring  times  (when  no  failures  occur  at  that  time)  as 
long  as  the  number  of  subjects  censored  is  subtracted  from  the  next  number 


17.5  Nonparametric  Estimation  of  S  and  A 


411 


Table  17.2  Summaries  used  in  Kaplan-Meier  computations 

i  tj  rij  dj  (rij  -  di)/rii 

1171  6/7 

2362  4/6 

3  9  2  1  1/2 


at  risk.  Kaplan-Meier  estimates  are  preferred  to  actuarial  estimates  because 
they  provide  more  resolution  and  make  fewer  assumptions.  In  constructing 
a  yearly  actuarial  life  table,  for  example,  it  is  traditionally  assumed  that 
subjects  censored  between  two  years  were  followed  0.5  years. 

The  product-limit  estimator  is  a  nonparametric  maximum  likelihood  es¬ 
timator  [331,  pp.  10-13].  The  formula  for  the  Kaplan-Meier  product-limit 
estimator  of  S(t)  is  as  follows.  Let  k  denote  the  number  of  failures  in  the 
sample  and  let  £i,  £2,  •  •  • ,  £/c  denote  the  unique  event  times  (ordered  for  ease 
of  calculation).  Let  d\  denote  the  number  of  failures  at  ti  and  rii  be  the  num¬ 
ber  of  subjects  at  risk  at  time  T;  that  is,  rq  =  number  of  failure/censoring 
times  >  ti  .  The  estimator  is  then 

SkmW  =  II  a-di/ni).  (17.27) 

The  Kaplan-Meier  estimator  of  A(t)  is  7Lkm(£)  =  —  logSKM(£)-  An  estimate 
of  quantile  q  of  failure  time  is  A^(l  —  <7),  if  follow-up  is  long  enough  so  that 
Skm  (£)  drops  as  low  as  1  —  q.  If  the  last  subject  followed  failed  so  that  5km  (£) 
drops  to  zero,  the  expected  failure  time  can  be  estimated  by  computing  the 
area  under  the  Kaplan-Meier  curve. 

To  demonstrate  computation  of  5km(£)>  imagine  a  sample  of  failure  times 
given  by 


1  3  3  6+  8+  9  10+, 

where  +  denotes  a  censored  time.  The  quantities  needed  to  compute  5km  are 
in  Table  17.2.  Thus 


Skm  (£)  =  1,  0  <  t  <  1 

=  6/7  =  .85,  1  <  £  <  3 

=  (6/7)(4/6)  =  .57,  3  <  £  <  9  (17.28) 

=  (6/7)(4/6)(l/2)  =  .29,  9  <  £  <  10. 


Note  that  the  estimate  of  S(t)  is  undefined  for  £  >  10  since  not  all  subjects 
have  failed  by  £  =  10  but  no  follow-up  extends  beyond  £  =  10.  A  graph  of  the 
Kaplan-Meier  estimate  is  found  in  Figure  17.6. 
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L 


require ( rms ) 


"t "t  -i —  c  ( 1 , 3 , 3 , 6 , 8 , 9 , 10) 
s "t at  -i —  c  ( 1  ,1  ,1 ,0 ,0 ,1 ,0) 

S  V-  Surv(tt,  stat) 

survplot ( npsurv ( S  ~  1),  conf  =  " bands "  ,  n.risk=TRUE, 
xlab  =  expression  (t)) 

survplot ( npsurv ( S  ~  1,  type  =  " f leming-harr ingt on  "  , 

conf . int =FALSE ) ,  add=TRUE ,  lty=3) 
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Fig.  17.6  Kaplan-Meier  product-limit  estimator  with  0.95  confidence  bands.  The 
Altschuler-Nelson-Aalen-Fleming-Harrington  estimator  is  depicted  with  the  dotted 
lines. 


The  variance  of  Skm(£)  can  be  estimated  using  Greenwood’s  formula  [331, 
p.  14],  and  using  normality  of  Skm(£)  in  large  samples  this  variance  can 
be  used  to  derive  a  confidence  interval  for  S(t).  A  better  method  is  to  de¬ 
rive  an  asymmetric  confidence  interval  for  S(t)  based  on  a  symmetric  in¬ 
terval  for  log  A(t).  This  latter  method  ensures  that  a  confidence  limit  does 
not  exceed  one  or  fall  below  zero,  and  is  more  accurate  since  logAxM^)  is 
more  normally  distributed  than  SkmW-  Once  a  confidence  interval,  say  [a,  b } 
is  determined  for  logA(£),  the  confidence  interval  for  S(t)  is  computed  by 
exp{—  exp(6)},  exp{—  exp(a)}].  The  formula  for  an  estimate  of  the  variance 
of  interest  is  [331,  p.  15]: 


^2i:ti<t  di/[ni{ni  ^i)\ 
{T,i:U<tl0S[(ni  -  di)/ni]}2' 


Var{log  tIkmW} 


(17.29) 
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Letting  s  denote  the  square  root  of  this  variance  estimate,  an  approximate 
1  —  a  confidence  interval  for  log  A(t)  is  given  by  log  Akm(£)  ±  zs  ,  where  z  is 
the  1  —  a/2  standard  normal  critical  value.  After  simplification,  the  confidence 
interval  for  S(t)  becomes 

5KM(i)exp(±zs).  (17.30) 

Even  though  the  log  A  basis  for  confidence  limits  has  theoretical  advan¬ 
tages,  on  the  log  —  log  scale  the  estimate  of  S(t)  has  the  greatest  instability 
where  much  information  is  available:  when  S(t)  falls  just  below  1.0.  For  that 
reason,  the  recommended  default  confidence  limits  are  on  the  A{t)  scale  using 


Var{ylKM(<)} 


(17.31) 


Letting  s  denote  its  square  root,  an  approximate  1—a  confidence  interval  for 
S(t)  is  given  by 

exp(±zs)SKM(t ),  (17.32) 


truncated  to  [0,1]. 


17.5.2  Altschuler-Nelson  Estimator 


Altschuler19,  Nelson472,  Aalen  and  Fleming  and  Harrington196  proposed  es¬ 
timators  of  A(t)  or  of  S(t)  based  on  an  estimator  of  A(t): 


m  =  £ 


di 

Hi 


SA(t)  =  exp(—A(t)). 


(17.33) 


SaA)  has  advantages  over  Skm(£)-  First,  YAi=i  A(Yi)  =  YAi=i  ei 
[605,  Appendix  3].  In  other  words,  the  estimator  gives  the  correct  expected 
number  of  events.  Second,  there  is  a  wealth  of  asymptotic  theory  based  on 
the  Altschuler-Nelson  estimator.196 

See  Figure  17.6  for  an  example  of  the  S'n(t)  estimator.  This  estimator  has 
the  same  variance  as  SkmA)  for  large  enough  samples. 


17.6  Analysis  of  Multiple  Endpoints 

Clinical  studies  frequently  assess  multiple  endpoints.  A  cancer  clinical  trial 
may,  for  example,  involve  recurrence  of  disease  and  death,  whereas  a  car¬ 
diovascular  trial  may  involve  nonfat al  myocardial  infarction  and  death.  End¬ 
points  may  be  combined,  and  the  new  event  (e.g.,  time  until  infarction  or 
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death)  may  be  analyzed  with  any  of  the  tools  of  survival  analysis  because  only 
the  usual  censoring  mechanism  is  used.  Sometimes  the  various  endpoints  may 
need  separate  study,  however,  because  they  may  have  different  risk  factors. 

When  the  multiple  endpoints  represent  multiple  causes  of  a  terminating 
event  (e.g.,  death),  Prentice  et  al.  have  developed  standard  methods  for  an¬ 
alyzing  cause-specific  hazards513  [331,  pp.  163-178].  Their  methods  allow 
each  cause  of  failure  to  be  analyzed  separately,  censoring  on  the  other  causes. 
They  do  not  assume  any  mechanism  for  cause  removal  nor  make  any  assump¬ 
tions  regarding  the  interrelation  among  causes  of  failure.  However,  analyses 
of  competing  events  using  data  where  some  causes  of  failure  are  removed  in 
a  different  way  from  the  original  dataset  will  give  rise  to  different  inferences. 

When  the  multiple  endpoints  represent  a  mixture  of  fatal  and  nonfatal 
outcomes,  the  analysis  may  be  more  complex.  The  same  is  true  when  one 
wishes  to  jointly  study  an  event-time  endpoint  and  a  repeated  measurement. 
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17.6.1  Competing  Risks 

When  events  are  independent,  each  event  may  also  be  analyzed  separately  by 
censoring  on  all  other  events  as  well  as  censoring  on  loss  to  follow-up.  This  will 
yield  an  unbiased  estimate  of  an  easily  interpreted  cause-specific  A (t)  or  S(t) 
because  censoring  is  non-informative  [331,  pp.  168-169].  One  minus  Skm(£) 
computed  in  this  manner  will  correctly  estimate  the  probability  of  failing  from 
the  event  in  the  absence  of  other  events.  Even  when  the  competing  events  are 
not  independent,  the  cause- specific  hazard  model  may  lead  to  valid  results, 
but  the  resulting  model  does  not  allow  one  to  estimate  risks  conditional  on 
removal  of  one  or  more  causes  of  the  event.  See  Kay340  for  a  nice  example 
of  competing  risks  analysis  when  a  treatment  reduces  the  risk  of  death  from 
one  cause  but  increases  the  risk  of  death  from  another  cause. 

Larson  and  Dinse3'  have  an  interesting  approach  that  jointly  models  the 
time  until  (any)  failure  and  the  failure  type.  For  r  failure  types,  they  use 
an  r- category  polytomous  logistic  model  to  predict  the  probability  of  failing 
from  each  cause.  They  assume  that  censoring  is  unrelated  to  cause  of  event. 


17.6.2  Competing  Dependent  Risks 

In  many  medical  and  epidemiologic  studies  one  is  interested  in  analyzing 
multiple  causes  of  death.  If  the  goal  is  to  estimate  cause-specific  failure  prob¬ 
abilities,  treating  subjects  dying  from  extraneous  causes  as  censored  and 
then  computing  the  ordinary  Kaplan-Meier  estimate  results  in  biased  (high) 
survival  estimates212,225.  If  cause  m  is  of  interest,  the  cause-specific  hazard 
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function  is  defined  as 


A  m(t)  =  lim 

u — ^0 


Prjfail  from  cause  m  in  [£,  t  +  u)\ alive  at  t} 


u 


(17.34) 


The  cumulative  incidence  function  or  probability  of  failure  from  cause  m  by 
time  t  is  given  by 

Fm(t)  =  [  Xm(u)S(u)du ,  (17.35) 

Jo 

where  S(u)  is  the  probability  of  surviving  (ignoring  cause  of  death),  which 
equals  exp[—  Xm(x))dx\  [212];  [444,  Chapter  10];  [102,408].  As  previously 

mentioned,  1  —  Fm(t)  =  exp[—  A m(u)du\  only  if  failures  due  to  other  causes 
are  eliminated  and  if  the  cause-specific  hazard  of  interest  remains  unchanged 
in  doing  so.212 

Again  letting  ti,  t2, . . . ,  tk  denote  the  unique  ordered  failure  times,  a  non- 
parametric  estimate  of  Fm(t )  is  given  by 


Fm(t)  =  — Skm(^i)5 

Hi 


(17.36) 


where  dmi  is  the  number  of  failures  of  type  m  at  time  L  and  ni  is  the  number 
of  subjects  at  risk  of  failure  at  L. 

Pepe  and  others494,496,497  showed  how  to  use  a  combination  of  Kaplan- 
Meier  estimators  to  derive  an  estimator  of  the  probability  of  being  free  of 
event  1  by  time  t  given  event  2  has  not  occurred  by  time  t  (see  also  [349]). 
Let  T\  and  T 2  denote,  respectively,  the  times  until  events  1  and  2.  Let  S\(t) 
and  S2  (t)  denote,  respectively,  the  two  survival  functions.  Let  us  suppose 
that  event  1  is  not  a  terminating  event  (e.g.,  is  not  death)  and  that  even 
after  event  1  subjects  are  followed  to  ascertain  occurrences  of  event  2.  The 
probability  that  T\  >  t  given  X2  >  t  is 


Prob{Ti  >  t\T2  >  t} 


Prob{Ti  >  t  and  T2  >  t} 
Prob{X2  >  t} 

Sl2(t) 

S2(t)  ’ 


(17.37) 


where  S  12(f)  is  the  survival  function  for  min(Ti,X2),  the  earlier  of  the  two 
events.  Since  Si2(t)  does  not  involve  any  informative  censoring  (assuming  as 
always  that  loss  to  follow-up  is  non-informative),  S12  may  be  estimated  by 
the  Kaplan-Meier  estimator  Skm12  (or  by  Sa)-  For  the  type  of  event  1  we 
have  discussed  above,  S2  can  also  be  estimated  without  bias  by  Skm2-  Thus 
we  estimate,  for  example,  the  probability  that  a  subject  still  alive  at  time  t 
will  be  free  of  myocardial  infarction  as  of  time  t  by  5'kMi2/^km2- 
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Another  quantity  that  can  easily  be  computed  from  ordinary  survival  es¬ 
timates  is  S2  (£)  —  £12  (t)  -  [1  -  Si2(t)]  —  [1  —  s2(t)],  which  is  the  probability 
that  event  1  occurs  by  time  £  and  that  event  2  has  not  occurred  by  time  £. 

The  ratio  estimate  above  is  used  to  estimate  the  survival  function  for  one 
event  given  that  another  has  not  occurred.  Another  function  of  interest  is 
the  crude  survival  function  which  is  a  marginal  distribution;  that  is,  it  is  the 
probability  that  T\  >  t  whether  or  not  event  2  occurs:362 

Sc(t)  =  1-F1(t) 

Fi(t)  =  ProbjTi  <  £},  (17.38) 

where  F\  (£)  is  the  crude  incidence  function  defined  previously.  Note  that  the 
Ti  <  t  implies  that  the  occurrence  of  event  1  is  part  of  the  probability  being 
computed.  If  event  2  is  a  terminating  event  so  that  some  subjects  can  never 
suffer  event  1,  the  crude  survival  function  for  T\  will  never  drop  to  zero.  The 
crude  survival  function  can  be  interpreted  as  the  survival  distribution  of  W 
where  W  =  T\  if  T\  <  T2  and  W  =  00  otherwise.362 
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17.6.3  State  Transitions  and  Multiple  Types 
of  Nonfatal  Events 

In  many  studies  there  is  one  final,  absorbing  state  (death,  all  causes)  and  mul¬ 
tiple  live  states.  The  live  states  may  represent  different  health  states  or  phases 
of  a  disease.  For  example,  subjects  may  be  completely  free  of  cancer,  have  an 
isolated  tumor,  metastasize  to  a  distant  organ,  and  die.  Unlike  this  example, 
the  live  states  need  not  have  a  definite  ordering.  One  may  be  interested  in  es¬ 
timating  transition  probabilities ,  for  example,  the  probability  7147 (U,  £2)  that 
an  individual  in  state  i  at  time  t\  is  in  state  j  after  an  additional  time  £2. 
Strauss  and  Shavelle59  have  developed  an  extended  Kaplan-Meier  estimator 
for  this  situation.  Let  SlKM(t\ti)  denote  the  ordinary  Kaplan-Meier  estimate 
of  the  probability  of  not  dying  before  time  £  (ignoring  distinctions  between 
multiple  live  states)  for  a  cohort  of  subjects  beginning  follow-up  at  time  £1 
in  state  i.  This  is  an  estimate  of  the  probability  of  surviving  an  additional  £ 
time  units  (in  any  live  state)  given  that  the  subject  was  alive  and  in  state  i 
at  time  t\.  Strauss  and  Shavelle’s  estimator  is  given  by 

Trij(ti,t2)  =  nftl,t2)slKM{t2 \h),  (17.39) 

ni\t 1, £2; 

where  rq(£i,£2)  is  the  number  of  subjects  in  live  state  i  at  time  t\  who  are 
alive  and  uncensored  £2  time  units  later,  and  n^(£i,  £2)  is  the  number  of  such 
subjects  in  state  j  £2  time  units  beyond  t\. 
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17.6.4  Joint  Analysis  of  Time  and  Severity 
of  an  Event 

In  some  studies,  an  endpoint  is  given  more  weight  if  it  occurs  earlier  or 
if  it  is  more  severe  clinically,  or  both.  For  example,  the  event  of  interest 
may  be  myocardial  infarction,  which  may  be  of  any  severity  from  minimal 
damage  to  the  left  ventricle  to  a  fatal  infarction.  Berridge  and  Whitehead52 
have  provided  a  promising  model  for  the  analysis  of  such  endpoints.  Their 
method  assumes  that  the  severity  of  endpoints  which  do  occur  is  measured 
on  an  ordinal  categorical  scale  and  that  severity  is  assessed  at  the  time  of 
the  event.  Berridge  and  Whitehead’s  example  was  time  until  first  headache, 
with  severity  of  headaches  graded  on  an  ordinal  scale.  They  proposed  a  joint 
hazard  of  an  individual  who  responds  with  ordered  category  j : 

Xj(t)  =  X(t)Trj(t),  (17.40) 

where  A (t)  is  the  hazard  for  the  failure  time  and  7 Tj(t)  is  the  probability  of  an 
individual  having  event  severity  j  given  she  fails  at  time  t.  Note  that  a  shift 
in  the  distribution  of  response  severity  is  allowed  as  the  time  until  the  event 
increases. 


17.6.5  Analysis  of  Multiple  Events 

It  is  common  to  choose  as  an  endpoint  in  a  clinical  trial  an  event  that  can 
recur.  Examples  include  myocardial  infarction,  gastric  ulcer,  pregnancy,  and 
infection.  Using  only  the  time  until  the  first  event  can  result  in  a  loss  of 
statistical  information  and  power. a  There  are  specialized  multivariate  survival 
models  (whose  assumptions  are  extremely  difficult  to  verify)  for  handling  this 
setup,  but  in  many  cases  a  simpler  approach  will  be  efficient. 

The  simpler  approach  involves  modeling  the  marginal  distribution  of  the 
time  until  each  event.407,495  Here  one  forms  one  record  per  subject  per  event, 
and  the  survival  time  is  the  time  to  the  first  event  for  the  first  record,  or  is 
the  time  from  the  previous  event  to  the  next  event  for  all  later  records.  This 
approach  yields  consistent  estimates  of  distribution  parameters  as  long  as  the 
marginal  distributions  are  correctly  specified.655  One  can  allow  the  number  of 
previous  events  to  influence  the  hazard  function  of  another  event  by  modeling 
this  count  as  a  covariable. 

The  multiple  events  within  subject  are  not  independent,  so  variance  esti¬ 
mates  must  be  corrected  for  intracluster  correlation.  The  clustered  sandwich 
covariance  matrix  estimator  described  in  Section  9.5  and  in  [407]  will  provide 

a  An  exception  to  this  is  the  case  in  which  once  an  event  occurs  for  the  first  time,  that 
event  is  likely  to  recur  multiple  times  for  any  patient.  Then  the  latter  occurrences  are 
redundant. 
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consistent  estimates  of  variances  and  covariances  even  if  the  events  are  de¬ 
pendent.  Lin407  also  discussed  how  this  method  can  easily  be  used  to  model 
multiple  events  of  differing  types. 


17.7  R  Functions 

The  event. chart  function  of  Lee  et  al.394  will  draw  a  variety  of  charts  for  dis¬ 
playing  raw  survival  time  data,  for  both  single  and  multiple  events  per  sub¬ 
ject.  Relationships  with  covariables  can  also  be  displayed.  The  event  .history 
function  of  Dubin  et  al.166  draws  an  event  history  graph  for  right-censored 
survival  data,  including  time-dependent  covariate  status.  These  function  are 
in  the  Hmisc  package. 

The  analyses  described  in  this  chapter  can  be  viewed  as  special  cases  of  the 
Cox  proportional  hazards  model.132  The  programs  for  Cox  model  analyses 
described  in  Section  20.13  can  be  used  to  obtain  the  results  described  here,  as 
long  as  there  is  at  least  one  stratification  factor  in  the  model.  There  are,  how¬ 
ever,  several  R  functions  that  are  pertinent  to  the  homogeneous  or  stratified 
case.  The  R  function  survfit,  and  its  particular  renditions  of  the  print,  plot, 
lines,  and  points  generic  functions  (all  part  of  the  survival  package  written 
by  Terry  Therneau),  will  compute,  print,  and  plot  Kaplan-Meier  and  Nelson 
survival  estimates.  Confidence  intervals  for  S(t)  may  be  based  on  5,  A,  or 
log  A  The  rms  package’s  front-end  to  the  survival  package’s  survfit  function 
is  npsurv  for  “nonpar ametric  survival”.  It  and  other  functions  described  in 
later  chapters  use  Therneau’s  Surv  function  to  combine  the  response  variable 
and  event  indicator  into  a  single  R  “survival  time”  object.  In  its  simplest  form, 
use  Surv(y,  event),  where  y  is  the  failure/right-censoring  time  and  event  is 
the  event /censoring  indicator,  usually  coded  T/F,  0  =  censored  1  =  event  or 
1  =  censored  2  =  event.  If  the  event  status  variable  has  other  coding  (e.g.,  3 
means  death),  use  Surv(y,  s==3).  To  handle  interval  time-dependent  covari¬ 
ables,  or  to  use  Andersen  and  Gill’s  counting  process  formulation  of  the  Cox 
model,  use  the  notation  Surv(tstart,  tstop,  status).  The  counting  process 
notation  allows  subjects  to  enter  and  leave  risk  sets  at  random.  For  each 
time  interval  for  each  subject,  the  interval  is  made  up  of  t start <  t  < tstop. 
For  time-dependent  stratification,  there  is  an  optional  origin  argument  to 
Surv  that  indicates  the  hazard  shape  time  origin  at  the  time  of  crossover 
to  a  new  stratum.  A  type  argument  is  used  to  handle  left-  and  interval- 
censoring,  especially  for  parametric  survival  models.  Possible  values  of  type 
are  "right"  ,  "left"  ,  "interval"  ,  "counting"  ,  "interval2"  ,  "instate". 

The  Surv  expression  will  usually  be  used  inside  another  function,  but  it  is 
fine  to  save  the  result  of  Surv  in  another  object  and  to  use  this  object  in  the 
particular  fitting  function. 

npsurv  is  invoked  by  the  following,  with  default  parameter  settings  indi¬ 
cated. 
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require ( rms ) 
units  (y)  V-  "Month" 

#  Default  is  "Day"  -  used  for  axis  labels ,  etc. 
npsurv ( Surv (y ,  event)  ~  svarl  +  svar2  +  ...  ,  data,  subset, 

type  =  c ( " kaplan-meier "  ,  " f leming-harr ingt on  "  ,  "fh2"), 

error =c (" greenwood " ,  "tsiatis"),  se.fit=TRUE, 
conf . int  =  .  95  , 

conf .type=c("log"  ,  "log-log" , "plain" , "none")  ,  .  .  . ) 

If  there  are  no  stratification  variables  (svarl,  . . . ),  omit  them.  To  print  a  table 
of  estimates,  use 

L 

f  V-  npsurv  (...) 

print  (f)  #  print  brief  summary  of  f 

summary  (f,  times  ,  censored  =  FALSE )  #  in  survival 

For  failure  times  stored  in  days,  use 

L 

f  V-  npsurv ( Surv ( fut ime ,  event)  ~  sex) 
summary (f ,  seq(30,  180,  by=30)) 

to  print  monthly  estimates. 

There  is  a  plot  method  To  plot  the  object  returned  by  survf  it  and  npsurv. 
This  invokes  plot .  survf  it. 

Objects  created  by  npsurv  can  be  passed  to  the  more  comprehensive  plot¬ 
ting  function  survplot  (here,  actually  survplot  .npsurv)  for  other  options  that 
include  automatic  curve  labeling  and  showing  the  number  of  subjects  at  risk 
at  selected  times.  See  Figure  17.6  for  an  example.  Stratified  estimates,  with 
four  treatments  distinguished  by  line  type  and  curve  labels,  could  be  drawn 

by 


units  (y)  V- 

"Year " 

L 

f  V-  npsurv 

( Surv  (y  ,  st at ) 

~  treatment  ) 

survplot  (f  , 

ylab=" Fraction 

Pain-Free  ") 

The  groupkm  in  rms  computes  and  optionally  plots  Skm(u)  or  log/lxM^)  (if 
loglog=TRUE)  for  fixed  u  with  automatic  stratification  on  a  continuous  predic¬ 
tor  x.  As  in  cut 2  (Section  6.2)  you  can  specify  the  number  of  subjects  per 
interval  (default  is  m=50),  the  number  of  quantile  groups  (g),  or  the  actual  cut- 
points  (cuts),  groupkm  plots  the  survival  or  log-log  survival  estimate  against 
mean  x  in  each  x  interval. 

The  bootkm  function  in  the  Hmisc  package  bootstraps  Kaplan-Meier  sur¬ 
vival  estimates  or  Kaplan-Meier  estimates  of  quantiles  of  the  survival  time 
distribution.  It  is  easy  to  use  bootkm  to  compute,  for  example,  a  nonparametric 
confidence  interval  for  the  ratio  of  median  survival  times  for  two  groups. 

See  the  Web  site  for  a  list  of  functions  from  other  users  for  nonparametric 
estimation  of  S(t)  with  left-,  right-,  and  interval-censored  data.  The  adaptive 
linear  spline  log-hazard  fitting  function  heft361  is  freely  available. 
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17.8  Further  Reading 
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Some  excellent  general  references  for  survival  analysis  are  [57,83, 114, 133, 154, 
197,282,308,331,350,382,392,444,484,574,604].  Govindarajulu  et  al.229  have 
a  nice  review  of  frailty  models  in  survival  analysis,  for  handling  clustered  time- 
to-event  data. 

See  Goldman,220  Bull  and  Spiegelhalter,83,  Lee  et  al.394,  and  Dubin  et  al.166 
for  ways  to  construct  descriptive  graphs  depicting  right-censored  data. 

Some  useful  references  for  left— truncation  are  [83,112,244,524].  Mandel435  care¬ 
fully  described  the  difference  between  censoring  and  truncation. 

See  [384,  p.  164]  for  some  ideas  for  detecting  informative  censoring.  Bilker  and 
Wang54  discuss  right-truncation  and  contrast  it  with  right-censoring. 

Arjas29  has  applications  based  on  properties  of  the  cumulative  hazard  function. 
Kooperberg  et  al. 361,594  have  an  adaptive  method  for  fitting  hazard  functions 
using  linear  splines  in  the  log  hazard.  Binquet  et  al.56  studied  a  related  approach 
using  quadratic  splines.  Mudholkar  et  al.466  presented  a  generalized  Weibull 
model  allowing  for  a  variety  of  hazard  shapes. 

Hollander  et  al.299  provide  a  nonparametric  simultaneous  confidence  band  for 
S(t),  surprisingly  using  likelihood  ratio  methods.  Miller459  showed  that  if  the 
parametric  form  of  S(t)  is  known  to  be  Weibull  with  known  shape  parameter  (an 
unlikely  scenario),  the  Kaplan-Meier  estimator  is  very  inefficient  (i.e. ,  has  high 
variance)  when  compared  with  the  parametric  maximum  likelihood  estimator. 
See  [666]  for  a  discussion  of  how  the  efficiency  of  Kaplan-Meier  estimators  can 
be  improved  by  interpolation  as  opposed  to  piecewise  flat  step  functions.  That 
paper  also  discusses  a  variety  of  other  estimators,  some  of  which  are  significantly 
more  efficient  than  Kaplan-Meier. 

See  [112,244,438,570,614,619]  for  methods  of  estimating  S'  or  A  in  the  presence 
of  left— truncation.  See  Turnbull616  for  nonparametric  estimation  of  S(t)  with 
left—,  right—,  and  interval-censoring,  and  Kooperberg  and  Clarkson360  for  a 
flexible  parametric  approach  to  modeling  that  allows  for  interval-censoring. 
Lindsey  and  Ryan413  have  a  nice  tutorial  on  the  analysis  of  interval-censored 
data. 

Hogan  and  Laird^97,298  developed  methods  for  dealing  with  mixtures  of  fa¬ 
tal  and  nonfatal  outcomes,  including  some  ideas  for  handling  outcome-related 
dropouts  on  the  repeated  measurements.  See  also  Finkelstein  and  Schoenfeld.193 
The  30  April  1997  issue  of  Statistics  in  Medicine  (Vol.  16)  is  devoted  to  methods 
for  analyzing  multiple  endpoints  as  well  as  designing  multiple  endpoint  stud¬ 
ies.  The  papers  in  that  issue  are  invaluable,  as  is  Therneau  and  Hamilton606 
and  Therneau  and  Grambsch.604  Huang  and  Wang311  presented  a  joint  model 
for  recurrent  events  and  a  terminating  event,  addressing  such  issues  as  the  fre¬ 
quency  of  recurrent  events  by  the  time  of  the  terminating  event. 

See  Lunn  and  McNeil429  and  Marubini  and  Valsecchi  [444,  Chapter  10]  for 
practical  approaches  to  analyzing  competing  risks  using  ordinary  Cox  propor¬ 
tional  hazards  models.  A  nice  overview  of  competing  risks  with  comparisons  of 
various  approaches  is  found  in  Tai  et  al.599,  Geskus214,  and  Koller  et  al.358. 
Bryant  and  Dignam'8  developed  a  semiparametric  procedure  in  which  com¬ 
peting  risks  are  adjusted  for  nonparametrically  while  a  parametric  cumulative 
incidence  function  is  used  for  the  event  of  interest,  to  gain  precision.  Fine  and 
Gray192  developed  methods  for  analyzing  competing  risks  by  estimating  sub¬ 
distribution  functions.  Nishikawa  et  al.4'8  developed  some  novel  approaches  to 
competing  risk  analysis  involving  time  to  adverse  drug  events  competing  with 
time  to  withdrawal  from  therapy.  They  also  dealt  with  different  severities  of 
events  in  an  interesting  way.  Putter  et  al.51'  has  a  nice  tutorial  on  competing 
risks,  multi-state  models,  and  associated  R  software.  Fiocco  et  al.194  developed 
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an  approach  to  avoid  the  problems  caused  by  having  to  estimate  a  large  num¬ 
ber  of  regression  coefficients  in  multi-state  models.  Ambrogi  et  al.22  provide 
clinically  useful  estimates  from  competing  risks  analyses. 

Jiang,  Chappell,  and  Fine322  present  methods  for  estimating  the  distribution 
of  event  times  of  nonfatal  events  in  the  presence  of  terminating  events  such  as 
death. 

Shen  and  Thall568  have  developed  a  flexible  parametric  approach  to  multi-state 
survival  analysis. 

Lancar  et  al.372  developed  a  method  for  analyzing  repeated  events  of  varying 
severities. 

Lawless  and  Nadeau384  have  a  very  good  description  of  models  dealing  with 
recurrent  events.  They  use  the  notion  of  the  cumulative  mean  function ,  which 
is  the  expected  number  of  events  experienced  by  a  subject  by  a  certain  time. 
Lawless383  contrasts  this  approach  with  other  approaches.  See  Aalen  et  al.1 2 3 4 5 
for  a  nice  example  in  which  multivariate  failure  times  (time  to  failure  of  fill¬ 
ings  in  multiple  teeth  per  subject)  are  analyzed.  Francis  and  Fuller204  devel¬ 
oped  a  graphical  device  for  depicting  complex  event  history  data.  Therneau  and 
Hamilton606  have  very  informative  comparisons  of  various  methods  for  model¬ 
ing  multiple  events,  showing  the  importance  of  whether  the  analyst  starts  the 
clock  over  after  each  event.  Kelly  and  Lim343  have  another  very  useful  paper 
comparing  various  methods  for  analyzing  recurrent  events.  Wang  and  Chang650 
demonstrated  the  difficulty  of  using  Kaplan-Meier  estimates  for  recurrence  time 
data. 


17.9  Problems 

1.  Make  a  rough  drawing  of  a  hazard  function  from  birth  for  a  man  who  de¬ 
velops  significant  coronary  artery  disease  at  age  50  and  undergoes  coronary 
artery  bypass  surgery  at  age  55. 

2.  Define  in  words  the  relationship  between  the  hazard  function  and  the  sur¬ 
vival  function. 

3.  In  a  study  of  the  life  expectancy  of  light  bulbs  as  a  function  of  the  bulb’s 
wattage,  100  bulbs  of  various  wattage  ratings  were  tested  until  each  had 
failed.  What  is  wrong  with  using  the  product-moment  linear  correlation 
test  to  test  whether  wattage  is  associated  with  life  length  concerning  (a) 
distributional  assumptions  and  (b)  other  assumptions? 

4.  A  placebo-controlled  study  is  undertaken  to  ascertain  whether  a  new  drug 
decreases  mortality.  During  the  study,  some  subjects  are  withdrawn  be¬ 
cause  of  moderate  to  severe  side  effects.  Assessment  of  side  effects  and 
withdrawal  of  patients  is  done  on  a  blinded  basis.  What  statistical  tech¬ 
nique  can  be  used  to  obtain  an  unbiased  treatment  comparison  of  survival 
times?  State  at  least  one  efficacy  endpoint  that  can  be  analyzed  unbiasedly. 

5.  Consider  long-term  follow-up  of  patients  in  the  support  dataset.  What  pro¬ 
portion  of  the  patients  have  censored  survival  times?  Does  this  imply  that 
one  cannot  make  accurate  estimates  of  chances  of  survival?  Make  a  his¬ 
togram  or  empirical  distribution  function  estimate  of  the  censored  follow¬ 
up  times.  What  is  the  typical  follow-up  duration  for  a  patient  in  the  study 
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who  has  survived  so  far?  What  is  the  typical  survival  time  for  patients  who 
have  died?  Taking  censoring  into  account,  what  is  the  median  survival  time 
from  the  Kaplan-Meier  estimate  of  the  overall  survival  function?  Estimate 
the  median  graphically  or  using  any  other  sensible  method. 

6.  Plot  Kaplan-Meier  survival  function  estimates  stratified  by  dzclass.  Esti¬ 
mate  the  median  survival  time  and  the  first  quartile  of  time  until  death 
for  each  of  the  four  disease  classes. 

7.  Repeat  Problem  6  except  for  tertiles  of  meanbp. 

8.  The  commonly  used  log-rank  test  for  comparing  survival  times  between 
groups  of  patients  is  a  special  case  of  the  test  of  association  between  the 
grouping  variable  and  survival  time  in  a  Cox  proportional  hazards  regres¬ 
sion  model.  Depending  on  how  one  handles  tied  failure  times,  the  log-rank 
X2  statistic  exactly  equals  the  score  y2  statistic  from  the  Cox  model,  and 
the  likelihood  ratio  and  Wald  y2  test  statistics  are  also  appropriate.  To 
obtain  global  score  or  LR  y2  tests  and  P- values  you  can  use  a  statement  as 
the  following,  where  cph  is  in  the  rms  package.  It  is  similar  to  the  survival 
package’s  coxph  function. 

cph ( Sur  vo  bject  rsj  predictor  ) 

Here  Survobject  is  a  survival  time  object  created  by  the  Surv  function.  Ob¬ 
tain  the  log-rank  (score)  y2  statistic,  degrees  of  freedom,  and  P-value  for 
testing  for  differences  in  survival  time  between  levels  of  dzclass.  Interpret 
this  test,  referring  to  the  graph  you  produced  in  Problem  6  if  needed. 

9.  Do  preliminary  analyses  of  survival  time  using  the  Mayo  Clinic  primary  bil¬ 
iary  cirrhosis  dataset  described  in  Section  8.9.  Make  graphs  of  Altschuler- 
Nelson  or  Kaplan-Meier  survival  estimates  stratified  separately  by  a  few 
categorical  predictors  and  by  categorized  versions  of  one  or  two  continuous 
predictors.  Estimate  median  failure  time  for  the  various  strata.  You  may 
want  to  suppress  confidence  bands  when  showing  multiple  strata  on  one 
graph.  See  [361]  for  parametric  fits  to  the  survival  and  hazard  function  for 
this  dataset. 


Chapter  18 

Parametric  Survival  Models 


18.1  Homogeneous  Models  (No  Predictors) 

The  nonparametric  estimator  of  S(t)  is  a  very  good  descriptive  statistic  for 
displaying  survival  data.  For  many  purposes,  however,  one  may  want  to  make 
more  assumptions  to  allow  the  data  to  be  modeled  in  more  detail.  By  speci¬ 
fying  a  functional  form  for  S(t)  and  estimating  any  unknown  parameters  in 
this  function,  one  can 

1.  easily  compute  selected  quantiles  of  the  survival  distribution; 

2.  estimate  (usually  by  extrapolation)  the  expected  failure  time; 

3.  derive  a  concise  equation  and  smooth  function  for  estimating  S(t),  A(t), 
and  A  (£);  and 

4.  estimate  S(t)  more  precisely  than  Skm  (£)  or  5n(t)  if  the  parametric  form 
is  correctly  specified. 


18.1.1  Specific  Models 

Parametric  modeling  requires  choosing  one  or  more  distributions.  The  Weibull 
and  exponential  distributions  were  discussed  in  Chapter  18.  Other  commonly 
used  survival  distributions  are  obtained  by  transforming  T  and  using  a  stan¬ 
dard  distribution.  The  log  transformation  is  most  commonly  employed.  The 
log-normal  distribution  specifies  that  log(T)  has  a  normal  distribution  with 
mean  fi  and  variance  a2.  Stated  another  way,  log(T)  ~  fi  +  <re,  where  e 
has  a  standard  normal  distribution.  Then  S(t)  =  1  —  <P((log (t)  —  /x) / cr), 
where  is  the  standard  normal  cumulative  distribution  function.  The  log- 
logistic  distribution  is  given  by  S(t)  =  [1  -f  exp(— (log(t)  —  fi) / cf)}~1  .  Here 
log(T)  ~  /r+cre  where  e  follows  a  logistic  distribution  [l+exp(— The  log 
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extreme  value  distribution  is  given  by  S(t)  =  exp[—  exp((log(t)  —  /x)/cr)],  and 
log  (T)  fi  +  ere,  where  e  ~  1  —  exp[—  exp(u)]. 

The  generalized  gamma  and  generalized  F  distributions  provide  a  richer 
variety  of  distribution  and  hazard  functions127,128.  Spline  hazard 
models286,287, 361  are  other  excellent  alternatives. 


18.1.2  Estimation 

Maximum  likelihood  (ML)  estimation  is  used  to  estimate  the  unknown 
parameters  of  S(t).  The  general  method  presented  in  Chapter  9  must  be 
augmented,  however,  to  allow  for  censored  failure  times.  The  basic  idea  is  as 
follows.  Again  let  T  be  a  random  variable  representing  time  until  the  event, 
Ti  be  the  (possibly  censored)  failure  time  for  the  ith  observation,  and  Y\ 
denote  the  observed  failure  or  censoring  time  min(Tj,  (7*),  where  C{  is  the 
censoring  time.  If  Y{  is  uncensored,  observation  %  contributes  a  factor  to  the 
likelihood  equal  to  the  density  function  for  T  evaluated  at  Y$,  f(Yi).  If  Y{ 
instead  represents  a  censored  time  so  that  Ti  =  1^+,  it  is  only  known  that 
Ti  exceeds  YL.  The  contribution  to  the  likelihood  function  is  the  probability 
that  Ti  >  Ci  (equal  to  Prob{T^  >  Y^}).  This  probability  is  S(Yi).  The  joint 
likelihood  over  all  observations  %  =  1,  2, . . . ,  n  is 

n  n 

l=  n  /w  n  5^)-  (i8-!) 

i:Yi  uncensored  i:Yi  censored 

There  is  one  more  component  to  L:  the  distribution  of  censoring  times  if 
these  are  not  fixed  in  advance.  Recall  that  we  assume  that  censoring  is  non- 
informative,  that  is,  it  is  independent  of  the  risk  of  the  event.  This  inde¬ 
pendence  implies  that  the  likelihood  component  of  the  censoring  distribution 
simply  multiplies  L  and  that  the  censoring  distribution  contains  little  infor¬ 
mation  about  the  survival  distribution.  In  addition,  the  censoring  distribution 
may  be  very  difficult  to  specify.  For  these  reasons  we  can  maximize  L  sepa¬ 
rately  to  estimate  parameters  of  S(t)  and  ignore  the  censoring  distribution. 

Recalling  that  f(t)  =  A and  A(t)  =  —  log  5(£),  the  log  likelihood 
can  be  written  as 


n  n 

log  L=  Y,  log  A (Yi)-YMYi).  (18.2) 

i'.Yi  uncensored  i—  1 

All  observations  then  contribute  an  amount  to  the  log  likelihood  equal  to  the 
negative  of  the  cumulative  hazard  evaluated  at  the  failure/censoring  time. 
In  addition,  uncensored  observations  contribute  an  amount  equal  to  the  log 
of  the  hazard  function  evaluated  at  the  time  of  failure.  Once  L  or  logL 
is  specified,  the  general  ML  methods  outlined  earlier  can  be  used  without 
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change  in  most  situations.  The  principal  difference  is  that  censored  observa¬ 
tions  contribute  less  information  to  the  statistical  inference  than  uncensored 
observations.  For  distributions  such  as  the  log-normal  that  are  written  only 
in  terms  of  5(t),  it  may  be  easier  to  write  the  likelihood  in  terms  of  S(t) 
and  fit). 

As  an  example,  we  turn  to  the  exponential  distribution,  for  which  log 
L  has  a  simple  form  that  can  be  maximized  explicitly.  Recall  that  for  this 
distribution  A (t)  =  A  and  A(t)  =  A t.  Therefore, 


n 


n 


iog£=  E  log  A  -  ^  XY, 


(18.3) 


i\Yo  uncensored 


i— 1 


Letting  nu  denote  the  number  of  uncensored  event  times, 


n 


log  L  =  nu  log  A  -  ^2  A Yi 


(18.4) 


i— 1 


Letting  w  denote  the  sum  of  all  failure/censoring  times  (“person  years  of 
exposure”): 


n 


W 


Er- 


(18.5) 


i—  1 


the  derivatives  of  log  L  are  given  by 

d  log  L 
dX 

d 2  log  L 
dX2 


=  nu/X  -  w 
=  -nu/ A2. 


(18.6) 


Equating  the  derivative  of  log  L  to  zero  implies  that  the  MLE  of  A  is 


A  =  nu/w  (18-7) 

or  the  number  of  failures  per  person-years  of  exposure.  By  inserting  the  MLE 
of  A  into  the  formula  for  the  second  derivative  we  obtain  the  observed  esti¬ 
mated  information,  w2  jnu.  The  estimated  variance  of  A  is  thus  nu/w2  and 

the  standard  error  is  n\J2 /w.  The  precision  of  the  estimate  depends  primarily 
on  nu. 

Recall  that  the  expected  life  length  fi  is  1/ A  for  the  exponential  distribu¬ 
tion.  The  MLE  of  fi  is  w/nu  and  its  estimated  variance  is  w2  jv?u.  The  MLE 
of  S'(t),  5(t),  is  exp  (—At),  and  the  estimated  variance  of  \og(A(t))  is  simply 
1/  nu  • 

As  an  example,  consider  the  sample  listed  previously, 


1  3  3  6+  8+  9  10+. 
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Here  nu  =  4  and  w  =  40,  so  the  MLE  of  A  is  0.1  failure  per  person-period. 
The  estimated  standard  error  is  2/40  =  0.05.  Estimated  expected  life  length 
is  10  units  with  a  standard  error  of  5  units.  Estimated  median  failure  time  is 
log(2)/0.1  =  6.931.  The  estimated  survival  function  is  exp(— O.lt),  which  at 
t  =  1,  3,  9, 10  yields  0.90,  0.74,  0.41,  and  0.37,  which  can  be  compared  to  the 
product  limit  estimates  listed  earlier  (0.85,0.57,0.29,0.29). 

Now  consider  the  Weibull  distribution.  The  log  likelihood  function  is 


log  L 


n 

i:Yi  uncensored 


Y,aYi- 


(18.8) 


Although  logL  can  be  simplified  somewhat,  it  cannot  be  solved  explicitly  for 
a  and  7.  An  iterative  method  such  as  the  Newton-Raphson  method  is  used 
to  compute  the  MLEs  of  a  and  7.  Once  these  estimates  are  obtained,  the 
estimated  variance-covariance  matrix  and  other  derived  quantities  such  as 
S(t)  can  be  obtained  in  the  usual  manner. 

For  the  dataset  used  in  the  exponential  fit,  the  Weibull  fit  follows. 


a  = 


7 


S(t) 

5-1(0.5) 


0.0728 

1.164 

exp(- 0.0728V164) 

[flog 2) /ft]1/ =  6.935  (estimated  median) 


(18.9) 


This  fit  is  very  close  to  the  exponential  fit  since  7  is  near  1.0.  Note  that  the 
two  medians  are  almost  equal.  The  predicted  survival  probabilities  for  the 
Weibull  model  for  t  =  1,3,  9, 10  are,  respectively,  0.93,  0.77,  0.39,  0.35. 

Sometimes  a  formal  test  can  be  made  to  assess  the  fit  of  the  proposed 
parametric  survival  distribution.  For  the  data  just  analyzed,  a  formal  test  of 
exponentiality  versus  a  Weibull  alternative  is  obtained  by  testing  Ho  :  7  =  1 
in  the  Weibull  model.  A  score  test  yielded  y2  =  0.14  with  1  d.f.,  p  =  0.7, 
showing  little  evidence  for  non-exponentiality  (note  that  the  sample  size  is 
too  small  for  this  test  to  have  any  power). 


18.1.3  Assessment  of  Model  Fit 

The  fit  of  the  hypothesized  survival  distribution  can  often  be  checked  eas¬ 
ily  using  graphical  methods.  Nonparametric  estimates  of  S(t)  and  A(t) 
are  primary  tools  for  this  purpose.  For  example,  the  Weibull  distribution 
S(t)  =  exp  (—at7)  can  be  rewritten  by  taking  logarithms  twice: 


log[—  logS'(t)]  =  log  A(t)  =  loga  +  y(logt). 


(18.10) 
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The  fit  of  a  Weibull  model  can  be  assessed  by  plotting  log  A(t)  versus  log  t 
and  checking  whether  the  curve  is  approximately  linear.  Also,  the  plotted 
curve  provides  approximate  estimates  of  a  (the  antilog  of  the  intercept)  and 
7  (the  slope).  Since  an  exponential  distribution  is  a  special  case  of  a  Weibull 
distribution  when  7  =  1,  exponentially  distributed  data  will  tend  to  have  a 
graph  that  is  linear  with  a  slope  of  1. 

For  any  assumed  distribution  S(t),  a  graphical  assessment  of  goodness  of 
fit  can  be  made  by  plotting  S'-1  [Sa(£)]  or  ^_1[^kmW]  against  t  and  checking 
for  linearity.  For  log  distributions,  S  specifies  the  distribution  of  log(T),  so 
we  plot  against  logt.  For  a  log-normal  distribution  we  thus  plot  ^-1[SA(t)] 
against  logt,  where  is  the  inverse  of  the  standard  normal  cumulative 
distribution  function.  For  a  log- logistic  distribution  we  plot  logit  [S'n(t)]  versus 
logt.  For  an  extreme  value  distribution  we  use  log  —  log  plots  as  with  the 

Weibull  distribution.  Parametric  model  fits  can  also  be  checked  by  plotting 

/\ 

the  fitted  S(t)  and  Sn(t)  against  t  on  the  same  graph. 


18.2  Parametric  Proportional  Hazards  Models 

In  this  section  we  present  one  way  to  generalize  the  survival  model  to  a 
survival  regression  model.  In  other  words,  we  allow  the  sample  to  be  hetero¬ 
geneous  by  adding  predictor  variables  X  =  {Xi,  X2, . . . ,  X^}.  As  with  other 
regression  models,  X  can  represent  a  mixture  of  binary,  polytomous,  continu¬ 
ous,  spline-expanded,  and  even  ordinal  predictors  (if  the  categories  are  scored 
to  satisfy  the  linearity  assumption).  Before  discussing  ways  in  which  the  re¬ 
gression  part  of  a  survival  model  might  be  specified,  first  recall  how  regression 
effects  have  been  modeled  in  other  settings.  In  multiple  linear  regression,  the 
regression  effect  X/3  =  /3o  +  P\X\  +  P2X2  +  . . .  +  PkXk  can  be  thought  of 
as  an  increment  in  the  expected  value  of  the  response  Y.  In  binary  logistic 
regression,  X/3  specifies  the  log  odds  that  Y  =  1,  or  exp(X/3)  multiplies  the 
odds  that  Y  —  1. 


18.2.1  Model 

The  most  widely  used  survival  regression  specification  is  to  allow  the  hazard 
function  A (t)  to  be  multiplied  by  exp(X/3).  The  survival  model  is  thus  gener¬ 
alized  from  a  hazard  function  A (t)  for  the  failure  time  T  to  a  hazard  function 
A(t)exp(X/3)  for  the  failure  time  given  the  predictors  X: 


A(t|X)  =  A  (t)  exp  (X/3). 


(18.11) 
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This  regression  formulation  is  called  the  proportional  hazards  (PH)  model. 
The  A (t)  part  of  \ft\X)  is  sometimes  called  an  underlying  hazard  function  or 
a  hazard  function  for  a  standard  subject ,  which  is  a  subject  with  Xfd  =  0.  Any 
parametric  hazard  function  can  be  used  for  A ft),  and  as  we  show  later,  A ft) 
can  be  left  completely  unspecified  without  sacrificing  the  ability  to  estimate 
/?,  by  the  use  of  Cox’s  semi-parametric  PH  model.  Depending  on  whether 
the  underlying  hazard  function  A ft)  has  a  constant  scale  parameter,  X/3  may 
or  may  not  include  an  intercept  /3q.  The  term  exp(X/3)  can  be  called  a  relative 
hazard  function  and  in  many  cases  it  is  the  function  of  primary  interest  as  it 
describes  the  (relative)  effects  of  the  predictors. 

The  PH  model  can  also  be  written  in  terms  of  the  cumulative  hazard  and 
survival  functions: 


Aft\X)  =  Aft)  ex p(X/3) 
S(t\X)  =  exp [—A(t)  exp {X/3)} 


exp  [—Aft)] 


exp(X/3) 


(18.12) 


Aft)  is  an  “underlying”  cumulative  hazard  function.  Sft \X),  the  probability 
of  surviving  past  time  t  given  the  values  of  the  predictors  A,  can  also  be 
written  as 

SV|V)  =  S(t)ex  p(X/3),  (18.13) 

where  Sft)  is  the  “underlying”  survival  distribution,  exp  (—Aft)).  The  effect 
of  the  predictors  is  to  multiply  the  hazard  and  cumulative  hazard  functions 
by  a  factor  exp(X/3),  or  equivalently  to  raise  the  survival  function  to  a  power 
equal  to  exp  (X/3). 


18.2.2  Model  Assumptions  and  Interpretation 
of  Parameters 

In  the  general  regression  notation  of  Section  2.2,  the  log  hazard  or  log  cumu¬ 
lative  hazard  can  be  used  as  the  property  of  the  response  T  evaluated  at  time 
t  that  allows  distributional  and  regression  parts  to  be  isolated  and  checked. 
The  PH  model  can  be  linearized  with  respect  to  X/3  using  the  following 
identities. 


log  \ft\X)  =  log  A(t)  +  X(d 

log  Aft\X)  =  log  Aft)  +  X/3.  (18.14) 

No  matter  which  of  the  three  model  statements  are  used,  there  are  certain 
assumptions  in  a  parametric  PH  survival  model.  These  assumptions  are  listed 
below. 

1.  The  true  form  of  the  underlying  functions  (A,  A,  and  S)  should  be  specified 
correctly. 
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2.  The  relationship  between  the  predictors  and  log  hazard  or  log  cumulative 
hazard  should  be  linear  in  its  simplest  form.  In  the  absence  of  interaction 
terms,  the  predictors  should  also  operate  additively. 

3.  The  way  in  which  the  predictors  affect  the  distribution  of  the  response 
should  be  by  multiplying  the  hazard  or  cumulative  hazard  by  exp(X/3) 
or  equivalently  by  adding  X/3  to  the  log  hazard  or  log  cumulative  hazard 
at  each  t.  The  effect  of  the  predictors  is  assumed  to  be  the  same  at  all 
values  of  t  since  logA(t)  can  be  separated  from  X/3.  In  other  words,  the 
PH  assumption  implies  no  t  by  predictor  interaction. 

The  regression  coefficient  for  Xj,  / 3j ,  is  the  increase  in  log  hazard  or  log 
cumulative  hazard  at  any  fixed  point  in  time  if  X3  is  increased  by  one  unit 
and  all  other  predictors  are  held  constant.  This  can  be  written  formally  as 

Pj  =  log  A(t|Xi,  X2, . . . ,  Xj  + 1,  Xj+ 1, . . . ,  Xk)  —  log  A(t|Xi, . . . ,  Xj, . . . ,  X&), 

(18.15) 

which  is  equivalent  to  the  log  of  the  ratio  of  the  hazards  at  time  t.  The 
regression  coefficient  can  just  as  easily  be  written  in  terms  of  a  ratio  of  hazards 
at  time  t.  The  ratio  of  hazards  at  Xj  +  d  versus  Xj,  all  other  factors  held 
constant,  is  exp (/3jd).  Thus  the  effect  of  increasing  Xj  by  d  is  to  increase  the 
hazard  of  the  event  by  a  factor  of  exp (/3jd)  at  all  points  in  time,  assuming  Xj 
is  linearly  related  to  log  A (t).  In  general,  the  ratio  of  hazards  for  an  individual 
with  predictor  variable  values  X*  compared  to  an  individual  with  predictors 
X  is 


X*  :  X  hazard  ratio  =  [X(t)  exp(X*/3)]/[A(£)  exp  (X/3)] 

=  exp  (X*/?)/ exp  (X/3)  =  exp[(X*  —  X)/3\.  (18.16) 

If  there  is  only  one  predictor  X\  and  that  predictor  is  binary,  the  PH  model 
can  be  written 


A(t|Xi  =  0)  =  A  (t) 

A(t|Xi  =  1)  =  X(t)  exp(/3i).  (18.17) 

Here  exp(/?i)  is  the  X\  —  1  :  X\  =  0  hazard  ratio.  This  simple  case  has 
no  regression  assumption  but  assumes  PH  and  a  form  for  A (t).  If  the  single 
predictor  X\  is  continuous,  the  model  becomes 

A(t|Xi)  =  X(t)  exp(/?iX).  (18.18) 

Without  further  modification  (such  as  taking  a  transformation  of  the  predic¬ 
tor),  the  model  assumes  a  straight  line  in  the  log  hazard  or  that  for  all  £,  an 
increase  in  X  by  one  unit  increases  the  hazard  by  a  factor  of  exp(/?i). 

As  in  logistic  regression,  much  more  general  regression  specifications  can 
be  made,  including  interaction  effects.  Unlike  logistic  regression,  however,  a 
model  containing,  say  age,  sex,  and  age  x  sex  interaction  is  not  equivalent  to 
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fitting  two  separate  models.  This  is  because  even  though  males  and  females 
are  allowed  to  have  unequal  age  slopes,  both  sexes  are  assumed  to  have  the 


Table  18.1  Mortality  differences  and  ratios  when  hazard  ratio  is  0.5 


Subject 

5-Year 
Survival 
C  T 

Difference 

Mortality 
Ratio  (T/C) 

1 

0.98  0.99 

0.01 

0.01/0.02  =  0.5 

2 

0.80  0.89 

0.09 

0.11/0.2  =  0.55 

3 

0.25  0.50 

0.25 

0.5/0.75  =  0.67 

underlying  hazard  function  proportional  to  A (t)  (i.e. ,  the  PH  assumption 
holds  for  sex  in  addition  to  age). 


18.2.3  Hazard  Ratio ,  Risk  Ratio,  and  Risk 
Difference 

Other  ways  of  modeling  predictors  can  also  be  specified  besides  a  multiplica¬ 
tive  effect  on  the  hazard.  For  example,  one  could  postulate  that  the  effect  of 
a  predictor  is  to  add  to  the  hazard  of  failure  instead  of  to  multiply  it  by  a 
factor.  The  effect  of  a  predictor  could  also  be  described  in  terms  of  a  mor¬ 
tality  ratio  (relative  risk),  risk  difference,  odds  ratio,  or  increase  in  expected 
failure  time.  However,  just  as  an  odds  ratio  is  a  natural  way  to  describe  an 
effect  on  a  binary  response,  a  hazard  ratio  is  often  a  natural  way  to  describe 
an  effect  on  survival  time.  One  reason  is  that  a  hazard  ratio  can  be  constant. 

Table  18.1  provides  treated  (T)  to  control  (C)  survival  (mortality)  dif¬ 
ferences  and  mortality  ratios  for  three  hypothetical  types  of  subjects.  We 
suppose  that  subjects  1,2,  and  3  have  increasingly  worse  prognostic  factors. 
For  example,  the  age  at  baseline  of  the  subjects  might  be  30,  50,  and  70  years, 
respectively.  We  assume  that  the  treatment  affects  the  hazard  by  a  constant 
multiple  of  0.5  (i.e.,  PH  is  in  effect  and  the  constant  hazard  ratio  is  0.5).  Note 
that  St  =  S^j5.  Notice  that  the  mortality  difference  and  ratio  depend  on  the 
survival  of  the  control  subject.  A  control  subject  having  “good”  predictor 
values  will  leave  little  room  for  an  improved  prognosis  from  the  treatment. 

The  hazard  ratio  is  a  basis  for  describing  the  mechanism  of  an  effect.  In  the 
above  example,  it  is  reasonable  that  the  treatment  affect  each  subject  by  low¬ 
ering  her  hazard  of  death  by  a  factor  of  2,  even  though  less  sick  subjects  have 
a  low  mortality  difference.  Hazard  ratios  also  lead  to  good  statistical  tests 
for  differences  in  survival  patterns  and  to  predictive  models.  Once  the  model 
is  developed,  however,  survival  differences  may  better  capture  the  impact  of 
a  risk  factor.  Absolute  survival  differences  rather  than  relative  differences 
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(hazard  ratios)  also  relate  more  closely  to  statistical  power.  For  example, 
even  if  the  effect  of  a  treatment  is  to  halve  the  hazard  rate,  a  population 
where  the  control  survival  is  0.99  will  require  a  much  larger  sample  than  will 
a  population  where  the  control  survival  is  0.3. 

Figure  18.1  depicts  the  relationship  between  survival  S(t)  of  a  control 
subject  at  any  time  t,  relative  reduction  in  hazard  (ft),  and  difference  in 
survival  S(t)  —  S(t)h.  This  figure  demonstrates  that  absolute  clinical  benefit 


Survival  for  Control  Subject 

Fig.  18.1  Absolute  clinical  benefit  as  a  function  of  survival  in  a  control  subject  and 
the  relative  benefit  (hazard  ratio).  The  hazard  ratios  are  given  for  each  curve. 


is  primarily  a  function  of  the  baseline  risk  of  a  subject.  Clinical  benefit  will 
also  be  a  function  of  factors  that  interact  with  treatment,  that  is,  factors 
that  modify  the  relative  benefit  of  treatment.  Once  a  model  is  developed 
for  estimating  S(t |X),  this  model  can  be  used  to  estimate  absolute  benefit 
as  a  function  of  baseline  risk  factors  as  well  as  factors  that  interact  with  a 
treatment.  Let  X\  be  a  binary  treatment  indicator  and  let  A  =  {X2, . . . ,  Xp} 
be  the  other  factors  (which  for  convenience  we  assume  do  not  interact  with 
Xi).  Then  the  estimate  of  S(t \X±  =  0,  A)  —  S(t \Xi  =  1,  A)  can  be  plotted 
against  S(t \Xi  =  0)  or  against  levels  of  variables  in  A  to  display  absolute 
benefit  versus  overall  risk  or  specific  subject  characteristics. 


18. 2.4  Specific  Models 

Let  X/3  denote  the  linear  combination  of  predictors  excluding  an  intercept 
term.  Using  the  PH  formulation,  an  exponential  survival  regression  model 
can  be  stated  as 
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X(t\X)  =  Aexp(X/3) 

S(t\X)  =  exp[— Xt  exp(X/3)]  =  exp(— Xt)e*p<yX(3\  (18.19) 

The  parameter  A  can  be  thought  of  as  the  antilog  of  an  intercept  term  since 
the  model  could  be  written  X(t\X)  =  exp[(log  A)  +  X/3].  The  effect  of  X  on 
the  expected  or  median  failure  time  is  as  follows. 


E{T\X}  =  1/ [A  exp(X/3)] 

Tq.s\X  =  (log  2)/[A  exp(X/3)].  (18.20) 


The  exponential  regression  model  can  be  written  in  another  form  that  is  more 
numerically  stable  by  replacing  the  A  parameter  with  an  intercept  term  in 
X/3,  specifically  A  =  exp(/3o).  After  redefining  X/3  to  include  /3o,  A  can  be 
dropped  in  all  the  above  formulas. 

The  Weibull  regression  model  is  defined  by  one  of  the  following  functions 
(assuming  that  X/3  does  not  contain  an  intercept). 


A(t|X)  =  apt7-1  exp(X/3) 
A(t\X)  =  at1  exp(X/3) 

S(t\X)  =  exp[— at1  exp(X/3)] 
exp(-aC)]exp(X/3). 


(18.21) 


Note  that  the  parameter  a  in  the  homogeneous  Weibull  model  has  been 
replaced  with  <aexp(X/3).  The  median  survival  time  is  given  by 


T0.5\x  =  {log2/[o;exp(V/3)]}1/ 


7 


(18.22) 


As  with  the  exponential  model,  the  parameter  a  could  be  dropped  (and 
replaced  with  exp(/3o))  if  an  intercept  /3o  is  added  to  X/3. 

For  numerical  reasons  it  is  sometimes  advantageous  to  write  the  Weibull 
PH  model  as 


S(t  |X)  =  exp(— A(t|X)), 


(18.23) 


where 

A(t\X)  =  exp(y  log  t  +  X/3).  (18.24) 


18.2.5  Estimation 

The  parameters  in  A  and  /3  are  estimated  by  maximizing  a  log  likelihood 
function  constructed  in  the  same  manner  as  described  in  Section  18.1.  The 
only  difference  is  the  insertion  of  exp(X^/3)  in  the  likelihood  function: 
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n 


n 


log  L=  ^2  log[A(n)  exp(Xj^)]  -  M  A(Yj)  exp(V/3).  (18.25) 


i:Yi  uncensored 


i—  1 


Once  /3,  the  MLE  of  /3,  is  computed  along  with  the  large-sample  standard 
error  estimates,  hazard  ratio  estimates  and  their  confidence  intervals  can 
readily  be  computed.  Letting  s  denote  the  estimated  standard  error  of  /3j, 

a  1  —  a  confidence  interval  for  the  X3  +  1  :  X3  hazard  ratio  is  given  by 

/\ 

exp[/3j  =b  zs\,  where  z  is  the  1  —  a/2  critical  value  for  the  standard  normal 
distribution. 

Once  the  parameters  of  the  underlying  hazard  function  are  estimated,  the 
MLE  of  A (t),  A (£),  can  be  derived.  The  MLE  of  X(t\X),  the  hazard  as  a 
function  of  t  and  X,  is  given  by 


X(t\X)  =  A(£)  exp(X/3). 


(18.26) 


The  MLE  of  A(t),  A(t ),  can  be  derived  from  the  integral  of  A (t)  with  respect 
to  t.  Then  the  MLE  of  S(t\X)  can  be  derived: 


S(t\X)  =  exp[— A(t)  exp(X^)] 


(18.27) 


For  the  Weibull  model,  we  denote  the  MLEs  of  the  hazard  parameters  a  and 
7  by  a  and  7.  The  MLE  of  A(t|X),  A(t\X),  and  S(t\X)  for  this  model  are 


X(t\X) 

A(t\X) 

S(t\X) 


1exp(X^) 
aX  exp(X/3) 
exp[—A(t\X)]. 


(18.28) 


Confidence  intervals  for  S(t\X)  are  best  derived  using  general  matrix  notation 
to  obtain  an  estimate  s  of  the  standard  error  of  log[A(t|X)]  from  the  estimated 
information  matrix  of  all  hazard  and  regression  parameters.  A  confidence 
interval  for  S  will  be  of  the  form 


svix)exp(±zs). 


(18.29) 


The  MLEs  of  [3  and  of  the  hazard  shape  parameters  lead  directly  to  MLEs 
of  the  expected  and  median  life  length.  For  the  Weibull  model  the  MLE  of 
the  median  life  length  given  X  is 


ro.5|X  =  {log2/[dexp(X/3)]}1/ 


7 


(18.30) 


For  the  exponential  model,  the  MLE  of  the  expected  life  length  for  a  subject 
having  predictor  values  X  is  given  by 

E(T|X)  =  [Aexp(X/3)]_1,  (18.31) 


where  A  is  the  MLE  of  A. 
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Fig.  18.2  PH  model  with  one  binary  predictor.  T-axis  is  logA(t)  or  log  A(t).  For 
logyl(t),  the  curves  must  be  non-decreasing.  For  logA(t),  they  may  be  any  shape. 


18.2.6  Assessment  of  Model  Fit 

Three  assumptions  of  the  parametric  PH  model  were  listed  in  Section  18.2.2. 
We  now  lay  out  in  more  detail  what  relationships  need  to  be  satisfied.  We 
first  assume  a  PH  model  with  a  single  binary  predictor  X\.  For  a  general 
underlying  hazard  function  A (£),  all  assumptions  of  the  model  are  displayed 
in  Figure  18.2.  In  this  case,  the  assumptions  are  PH  and  a  shape  for  A (t). 

If  A (t)  is  Weibull,  the  two  curves  will  be  linear  if  log  t  is  plotted  instead 
of  t  on  the  x-axis.  Note  also  that  if  there  is  no  association  between  X  and 
survival  (/?i  =  0),  estimates  of  the  two  curves  will  be  close  and  will  intertwine 
due  to  random  variability.  In  this  case,  PH  is  not  an  issue. 

If  the  single  predictor  is  continuous,  the  relationships  in  Figures  18.3 
and  18.4  must  hold.  Here  linearity  is  assumed  (unless  otherwise  specified) 
besides  PH  and  the  form  of  A (t).  In  Figure  18.3,  the  curves  must  be  parallel 
for  any  choices  of  times  t\  and  t<i  as  well  as  each  individual  curve  being  lin¬ 
ear.  Also,  the  difference  between  ordinates  needs  to  conform  to  the  assumed 
distribution.  This  difference  is  log[A(t2)/A(£i)]  or  \og[A(t2)  /  A(t\)  . 

Figure  18.4  highlights  the  PH  assumption.  The  relationship  between  the 
two  curves  must  hold  for  any  two  values  c  and  d  of  X\.  The  shape  of  the 
function  for  a  given  value  of  X\  must  conform  to  the  assumed  A (£).  For  a 
Weibull  model,  the  functions  should  each  be  linear  in  log  t. 

When  there  are  multiple  predictors,  the  PH  assumption  can  be  displayed  in 
a  way  similar  to  Figures  18.2  and  18.4  but  with  the  population  additionally 
cross-classified  by  levels  of  the  other  predictors  besides  X\.  If  there  is  one 
binary  predictor  X\  and  one  continuous  predictor  X2,  the  relationship  in 
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Fig.  18.3  PH  model  with  one  continuous  predictor,  T-axis  is  log  A(t)  or  logT(t);  for 
logyl(t),  drawn  for  £2  >  t\.  The  slope  of  each  line  is 


Fig.  18.4  PH  model  with  one  continuous  predictor.  T-axis  is  log  A (t)  or  logyl(£).  For 
log  A,  the  functions  need  not  be  monotonic. 


Figure  18.5  must  hold  at  each  time  t  if  linearity  is  assumed  for  X2  and  there 
is  no  interaction  between  X\  and  X2.  Methods  for  verifying  the  regression 
assumptions  (e.g.,  splines  and  residuals)  and  the  PH  assumption  are  covered 
in  detail  under  the  Cox  PH  model  in  Chapter  20. 

The  method  for  verifying  the  assumed  shape  of  S(t)  in  Section  18.1.3  is  also 
useful  when  there  are  a  limited  number  of  categorical  predictors.  To  validate 
a  Weibull  PH  model  one  can  stratify  on  X  and  plot  logAKM(t\X  stratum) 
against  log£.  This  graph  simultaneously  assesses  PH  in  addition  to  shape 
assumptions — all  curves  should  be  parallel  as  well  as  straight.  Straight  but 
nonparallel  (non-PH)  curves  indicate  that  a  series  of  Weibull  models  with 
differing  7  parameters  will  fit. 
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Fig.  18.5  Regression  assumptions,  linear  additive  PH  or  AFT  model  with  two  pre¬ 
dictors.  For  PH,  Y-axis  is  log  A (t)  or  log  A(t)  for  a  fixed  t.  For  AFT,  Y-axis  is  log(T). 


18.3  Accelerated  Failure  Time  Models 


18.3.1  Model 

Besides  modeling  the  effect  of  predictors  by  a  multiplicative  effect  on  the 
hazard  function,  other  regression  effects  can  be  specified.  The  accelerated 
failure  time  (AFT)  model  is  commonly  used;  it  specifies  that  the  predictors 
act  multiplicatively  on  the  failure  time  or  additively  on  the  log  failure  time. 
The  effect  of  a  predictor  is  to  alter  the  rate  at  which  a  subject  proceeds  along 
the  time  axis  (i.e.,  to  accelerate  the  time  to  failure  [331,  pp.  33-35]).  The 
model  is 

S(t\x)  =  i/>((log(t)  -  XP)!<j\  (18.32) 

where  ip  is  any  standardized  survival  distribution  function.  The  parameter  a  is 
called  the  scale  parameter.  The  model  can  also  be  stated  as  (log(T)—  Xf3)/cr  ~ 
ip  or  log(T)  =  X/3  +  ere,  where  e  is  a  random  variable  from  the  distribution 
fj.  Sometimes  the  untransformed  T  is  used  in  place  of  log(T).  When  the  log 
form  is  used,  the  models  are  said  to  be  log-normal,  log-logistic,  and  so  on. 

The  exponential  and  Weibull  are  the  only  two  distributions  that  can  de¬ 
scribe  either  a  PH  or  an  AFT  model. 


18.3.2  Model  Assumptions  and  Interpretation 
of  Parameters 

The  log  A  or  log  A  transformation  of  the  PH  model  has  the  following  equiva¬ 
lent  for  AFT  models. 
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ip  1[S(t\X)]  =  (log(t)  —  Xfl)/cr.  (18.33) 


Letting  as  before  e  denote  a  random  variable  from  the  distribution  S',  the 
model  is  also 

log(T)  =  Xf3  +  cre.  (18.34) 

So  the  property  of  the  response  T  of  interest  for  regression  modeling  is  log(T). 
In  the  absence  of  censoring,  we  could  check  the  model  by  plotting  an  X 
against  logT  and  checking  that  the  residuals  log(T)  —  X/3  are  distributed  as 
ip  to  within  a  scale  factor. 

The  assumptions  of  the  AFT  model  are  thus  the  following. 

1.  The  true  form  of  ip  (the  distributional  family)  is  correctly  specified. 

2.  In  the  absence  of  nonlinear  and  interaction  terms,  each  Xj  affects  log(T) 
or  ip~l  [S(t\X)]  linearly. 

3.  Implicit  in  these  assumptions  is  that  a  is  a  constant  independent  of  X. 


A  one-unit  change  in  Xj  is  then  most  simply  understood  as  a  /3j  change  in 
the  log  of  the  failure  time.  The  one- unit  change  in  Xj  increases  the  failure 
time  by  a  factor  of  exp (/3j). 

The  median  survival  time  is  obtained  by  solving  ^((log(t)  —  X/3) /a)  =  0.5 
giving 


X  =  exp[X/3  +  crpj  1  (0.5)] 


(18.35) 


18.3.3  Specific  Models 


Common  choices  for  the  distribution  function  ip  in  Equation  18.32  are  the 
extreme  value  distribution  ip{u)  =  exp(—  exp(u)),  the  logistic  distribution 
ip(u)  =  [1  +  exp('u)]-1,  and  the  normal  distribution  ip{u)  =  1  —  $(u).  The 
AFT  model  equivalent  of  the  Weibull  model  is  obtained  by  using  the  extreme 
value  distribution,  negating  /?,  and  replacing  7  with  1/cr  in  Equation  18.24: 

S(t\X)  =  exp[— exp((log(t)  —  X/3)/a) 

T0.5|X  =  [log(2)](Jexp(X/3).  (18.36) 

The  exponential  model  is  obtained  by  restricting  a  =  1  in  the  extreme  value 
distribution. 

The  log-normal  regression  model  is 


S(t\x)  =  1  -  *((log(t)  -  X(3)/a), 


and  the  log-logistic  model  is 


S(t\X) 


[1  +  exp((log(<)  -  X/3) /a) 


1 


(18.37) 


(18.38) 
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The  t  distribution  allows  for  more  flexibility  by  varying  the  degrees  of  free¬ 
dom.  Figure  18.6  depicts  possible  hazard  functions  for  the  log  t  distribution 
for  varying  a  and  degrees  of  freedom.  However,  this  distribution  does  not 
have  a  late  increasing  hazard  phase  typical  of  human  survival. 

L 

require ( rms ) 


haz  -<—  survreg. auxinf o$t$hazard 

times  V-  c(seq(0,  .25,  length  =  100),  seq(.26,  2,  length  =  150)) 

high  V-  c (6 ,  1.5,  1.5,  1.75) 

low  V-  c(0,  0,  0,  .25) 

dfs  V-  c  ( 1 ,  2,  3,  5,  7,  15,  500) 

cols  V-  rep ( 1 ,  7) 

It  y  s  1:7 

i  V-  0 

for (scale  in  c  (  . 25  ,  .6,  1,  2))  { 

i  V-  i  +  1 

plot(0,  0,  xlim  =  c(0,2),  ylim  =  c ( low  [i]  ,  high[i]), 

xlab=expression (t) ,  y lab= expre s s i on ( lambda ( t )) ,  type="n") 
col  4 —  1.09 

j  <-  o 

f or ( df  in  dfs)  { 

j  j+l 

##  Divide  by  t  to  get  hazard  for  log  t  distribution 
lines (times  , 

haz ( log ( t ime s )  ,  0,  c ( log ( scale )  ,  df))/times, 

col=cols[j],  lty=ltys[j]) 
if(i==l)  text  (1 .7  ,  .23  +  haz ( log ( 1 . 7 )  ,  0, 

c (log (scale) , df ) ) / 1 . 7 ,  format (df ) ) 

} 

title (paste ("Scale : " ,  format (scale))) 

}  #  Figure  18.6 


All  three  of  these  parametric  survival  models  have  median  survival  time 
T0.5|X  =  exp(A/3). 


18. 3. 4  Estimation 

Maximum  likelihood  estimation  is  used  much  the  same  as  in  Section  18.2.5. 
Care  must  be  taken  in  the  choice  of  initial  values;  iterative  methods  are 
especially  prone  to  problems  in  choosing  the  initial  a.  Estimation  works  better 
if  a  is  parameterized  as  exp(d).  Once  j3  and  a  (exp ((5))  are  estimated,  MLEs  of 
secondary  parameters  such  as  survival  probabilities  and  medians  can  readily 
be  obtained: 


S(t\X)  =  il>(Qog(t)-XP)/&) 
To.5|V  =  exp[V/3  +  (T^- 1  (0.5)] . 


(18.39) 
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Scale:  0.25  Scale:  0.6 


Scale:  1  Scale:  2 


Fig.  18.6  log(T)  distribution  for  a  =  0.25, 0.6,  1, 2  and  for  degrees  of  freedom 
1,  2,  3,  5,  7,  15,  500  (almost  log-normal).  The  top  left  plot  has  degrees  of  freedom  writ¬ 
ten  in  the  plot. 


/\ 

For  normal  and  logistic  distributions,  To. 5 


X  =  exp(X/3).  The  MLE  of  the 


effect  on  log(T)  of  increasing  Xj  by  d  units  is  $jd  if  Xj  is  linear  and  additive. 

The  delta  (statistical  differential)  method  can  be  used  to  compute  an  esti- 
mate  of  the  variance  of  /  =  [log(t)  —  X(3\/a.  Let  (/3,  (5)  denote  the  estimated 
parameters,  and  let  V  denote  the  estimated  covariance  matrix  for  these  pa¬ 
rameter  estimates.  Let  F  denote  the  vector  of  derivatives  of  /  with  respect  to 
(/3o,/3i,  •  •  -,/3P,5);  that  is,  F  =  [-1,  -Xx,  -X2, -Xp,  -(log  (t)  -  Xj3)]/a. 


The  variance  of  /  is  then  approximately 


Var  (/)  =  FVF'. 


(18.40) 
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Letting  s  be  the  square  root  of  the  variance  estimate  and  M-o/2  be  the 
normal  critical  value,  a  1  —  a  confidence  limit  for  S(t\X)  is 

Wlog (t)  -  xj3)/a  ±  ^i— a/2  x  s).  (18.41) 


18.3.5  Residuals 

For  an  AFT  model,  standardized  residuals  are  simply 

r  =  (log  (T)-Xj3)/a.  (18.42) 

When  T  is  right-censored,  r  is  right-censored.  Censoring  must  be  taken  into 
account,  for  example,  by  displaying  Kaplan-Meier  estimates  based  on  groups 
of  residuals  rather  than  showing  individual  residuals.  The  residuals  can  be 
used  to  check  for  lack  of  fit  as  described  in  the  next  section.  Note  that  exam¬ 
ining  individual  uncensored  residuals  is  not  appropriate,  as  their  distribution 
is  conditional  on  Ti  <  Ci,  where  Ci  is  the  censoring  time. 

Cox  and  Snell  4  proposed  a  type  of  general  residuals  that  also  work  for 
censored  data.  Using  their  method  on  the  cumulative  probability  scale  results 
in  the  probability  integral  transformation.  If  the  probability  of  failure  before 
time  t  given  X  is  S(t\X),  F(T\X)  =  1  —  S(T\X)  has  a  uniform  [0, 1]  distri¬ 
bution,  where  T  is  a  subject’s  actual  failure  time.  When  T  is  right-censored, 

A 

so  is  1  —  S(T\X).  Substituting  S  for  S  results  in  an  approximate  uniform 
[0, 1]  distribution  for  any  value  of  X.  One  minus  the  Kaplan-Meier  estimate 
of  1  —  S(T\X)  (using  combined  data  for  all  X)  is  compared  against  a  45° 
line  to  check  for  goodness  of  fit.  A  more  stringent  assessment  is  obtained  by 
repeating  this  process  while  stratifying  on  X. 


18.3.6  Assessment  of  Model  Fit 

For  a  single  binary  predictor,  all  assumptions  of  the  AFT  model  are  depicted 
in  Figure  18.7.  That  figure  also  shows  the  assumptions  for  any  two  values  of 
a  single  continuous  predictor  that  behaves  linearly.  For  a  single  continuous 
predictor,  the  relationships  in  Figure  18.8  must  hold  for  any  two  follow-up 
times.  The  regression  assumptions  are  isolated  in  Figure  18.5. 

To  verify  the  fit  of  a  log-logistic  model  with  age  as  the  only  predictor,  one 
could  stratify  by  quartiles  of  age  and  check  for  linearity  and  parallelism  of  the 
four  logit  SA(t)  or  Skm(£)  curves  over  increasing  t  as  in  Figure  18.7,  which 
stresses  the  distributional  assumption  (no  T  by  X  interaction  and  linearity  vs. 
log(t)).  To  stress  the  linear  regression  assumption  while  checking  for  absence 
of  time  interactions  (part  of  the  distributional  assumptions),  one  could  make 
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Fig.  18.7  AFT  model  with  one  predictor.  T-axis  is  ip  1[*S'(£|Ad]  —  (log(£)  —  Xf3)/<j. 
Drawn  for  d  >  c.  The  slope  of  the  lines  is  a-1. 


Fig.  18.8  AFT  model  with  one  continuous  predictor.  T-axis  is  p  1[S'(£|X)]  = 
(log(£)  —  X(3)/cr.  Drawn  for  £2  >  £1.  The  slope  of  each  line  is  f3i/cr  and  the  difference 
between  the  lines  is  log (£2/^1  )/<r. 


a  plot  like  Figure  18.8.  For  each  decile  of  age,  the  logit  transformation  of  the 
1-,  3-,  and  5-year  survival  estimates  for  that  decile  would  be  plotted  against 
the  mean  age  in  the  decile.  This  checks  for  linearity  and  constancy  of  the 
age  effect  over  time.  Regression  splines  will  be  a  more  effective  method  for 
checking  linearity  and  determining  transformations.  This  is  demonstrated  in 
Chapter  20  with  the  Cox  model,  but  identical  methods  apply  here. 

As  an  example,  consider  data  from  Kalbfleisch  and  Prentice  [331,  pp.  1-2], 
who  present  data  from  Pike508  on  the  time  from  exposure  to  the  carcinogen 
DMBA  to  mortality  from  vaginal  cancer  in  rats.  The  rats  are  divided  into 
two  groups  on  the  basis  of  a  pre-treatment  regime.  Survival  times  in  days 
(with  censored  times  marked  +)  are  found  in  Table  18.2. 
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Table  18.2  Rat  vaginal  cancer  data  from  Pike508 

Group  1 1 143  164  188  188  190  192  206  209  213  216 

220  227  230  234  246  265  304  216+  244+ 

Group  2  142  156  163  198  205  232  232  233  233  233 

233  239  240  261  280  280  296  296  323  204+ 

344+ 


getHdata (kprats) 

kprats$group  <—  f  act  or  (  kpr  at  s  $  group  ,  0:1,  c ('  Group  1',  'Group  2')) 

dd  V-  dat  adi  st  (  kpr  at  s  )  ;  opt  ions  (  dat  adi  st  =  "  dd  "  ) 

S  <—  with(kprats  ,  Surv(t  ,  death)) 

f  <—  npsurv (S  ~  group,  type  =  " f leming  "  ,  data  =  kprats) 
survplot  (f  ,  n.risk  =  TRUE,  conf = 'none  1  ,  #  Figure  18.9 

label . curves  =  list (keys=  'lines  ')  ,  levels. only  =TRUE) 
title ( sub=" Nonparametric  estimates",  adj =0 ,  cex=.7) 

#  Check  fits  of  Weibull ,  log-logistic,  log-normal 
xl  c  (4 . 8  ,  5.9) 

survplot  (f  ,  loglog  =  TRUE,  logt  =  TRUE,  conf  =  "none" ,  xlim  =  xl  , 

label . curves  =  list (keys=  'lines  ')  ,  levels. only  =TRUE) 
title ( sub  =  " Weibull  (extreme  value)",  adj =0 ,  cex=  .7) 
survplot (f ,  fun  =  funct ion ( y ) log ( y / (  1 -y ) )  ,  ylab="logit  S(t)", 
logt=TRUE,  conf ="none" ,  xlim=xl , 

label . curves  =  list (keys=  'lines  ')  ,  levels. only  =TRUE) 
title ( sub=" Log-logistic " ,  adj =0 ,  cex=.7) 

survplot (f ,  fun=qnorm ,  ylab=" Inverse  Normal  S(t)", 
logt=TRUE,  conf ="none" , 
xlim=xl , cex.label=.7 , 

label . curves  =  list (keys=  'lines  ')  ,  levels. only  =TRUE) 
title ( sub=" Log-normal " ,  adj =0 ,  cex=.7) 


The  top  left  plot  in  Figure  18.9  displays  nonparametric  survival  estimates  for 
the  two  groups,  with  the  number  of  rats  “at  risk”  at  each  30-day  mark  written 
above  the  x-axis.  The  remaining  three  plots  are  for  checking  assumptions  of 
three  models.  None  of  the  parametric  models  presented  will  completely  allow 
for  such  a  long  period  with  no  deaths.  Neither  will  any  allow  for  the  early 
crossing  of  survival  curves.  Log-normal  and  log-logistic  models  yield  very  sim¬ 
ilar  results  due  to  the  similarity  in  shapes  between  @(z)  and  [1  -fexp(— z)]~1 
for  non-extreme  z.  All  three  transformations  show  good  parallelism  after  the 
early  crossing.  The  log-logistic  and  log-normal  transformations  are  slightly 
more  linear.  The  fitted  models  are: 


f  w 

e- 

psm  ( S  ~  group  , 

data  =  kprats  , 

L 

dist  = ' weibull  '  ) 

f  1 

psm  (  S  ~  group  , 

data  =  kprats  , 

dist  = 'loglogistic  '  , 

y=TRUE ) 

fn 

psm  (  S  ~  group  , 

data  =  kprats  , 

dist  = ' lognormal  '  ) 

lat ex  ( f w  ,  f i =  '  '  ) 
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Nonparametric  estimates 


Log-logistic 


log  Survival  Time  in  s 

Log-normal 


Fig.  18.9  Altschuler-Nelson-Fleming-Harrington  nonparametric  survival  estimates 
for  rats  treated  with  DMBA,508  along  with  various  transformations  of  the  estimates 
for  checking  distributional  assumptions  of  three  parametric  survival  models. 


Prob{T  >  t} 


exp[—  exp( 


log (t)  ~  XP 

0.1832976 


)] 


where 


Xj3  = 

5.450859 

+0.131983 [Group  2] 


and  [c]  =  1  if  subject  is  in  group  c,  0  otherwise. 


lat ex  ( f 1  ,  f i =  '  '  ) 
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Table  18.3  Group  effects  from  three  survival  models 

Model  Group  2:1  Median  Survival  Time 


Failure  Time  Ratio  Group  1  Group  2 


Extreme  Value  (Weibull) 

1.14 

217 

248 

Log-logistic 

1.11 

217 

241 

Log-normal 

1.10 

217 

238 

Prob{T  >  t}  = 

ri  ,  ^log (t)-XP 

1  +  exp(  ) 

L  0.1159753  J 

“ 1  where 

xp  = 

5.375675 

+0.1051005 [Group  2] 


and  [c]  =  1  if  subject  is  in  group  c, 


lat ex  ( f n  ,  f i =  '  '  ) 


0  otherwise. 

L 


ProbfT  >  t}  =  1  —  <2>( — - — )  where 

1  ~  J  y  0.2100184  ' 

xp  = 

5.375328 

+0.0930606 [Group  2] 


and  [c]  =  1  if  subject  is  in  group  c,  0  otherwise. 

The  estimated  failure  time  ratios  and  median  failure  times  for  the  two 
groups  are  given  in  Table  18.3.  For  example,  the  effect  of  going  from  Group  1 
to  Group  2  is  to  increase  log  failure  time  by  0.132  for  the  extreme  value  model, 
giving  a  Group  2:1  failure  time  ratio  of  exp(0.132)  =  1.14.  This  ratio  is  also 
the  ratio  of  median  survival  times.  We  choose  the  log-logistic  model  for  its 
simpler  form.  The  fitted  survival  curves  are  plotted  with  the  nonpar ametric 
estimates  in  Figure  18.10.  Excellent  agreement  is  seen,  except  for  150  to  180 
days  for  Group  2.  The  standard  error  of  the  regression  coefficient  for  group 
in  the  log-logistic  model  is  0.0636  giving  a  Wald  y2  for  group  differences  of 
(.105/.0636)2  =  2.73,  P  =  0.1. 

survplot (f ,  conf . int =FALSE  ,  #  Figure  18.10 

levels . only =TRUE ,  label . curves =list ( keys =' lines  ')) 
survplot (fl ,  add=TRUE ,  label . curves =FALSE ,  conf . int =FALSE ) 

The  Weibull  PH  form  of  the  fitted  extreme  value  model,  using  Equa¬ 
tion  18.24,  is 
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0  35  70  140  210  280  350 


S 

Fig.  18.10  Agreement  between  fitted  log-logistic  model  and  nonparametric  survival 
estimates  for  rat  vaginal  cancer  data. 


Prob{T  >  t}  =  exp{— t5-456  exp(X/3)}  where 


Xj3  = 

-29.74 

— 0.72[Group  2] 


and  [c]  =  1  if  subject  is  in  group  c,  0  otherwise. 

A  sensitive  graphical  verification  of  the  distributional  assumptions  of  the 
AFT  model  is  obtained  by  plotting  the  estimated  survival  distribution  of 
standardized  residuals  (Equation  18.3.5),  censored  identically  to  the  way  T 
is  censored.  This  distribution  is  plotted  along  with  the  theoretical  distri¬ 
bution  ip.  The  assessment  may  be  made  more  stringent  by  stratifying  the 
residuals  by  important  subject  characteristics  and  plotting  separate  survival 
function  estimates;  they  should  all  have  the  same  standardized  distribution 
(e.g.,  same  a). 


r  V-  resid(fl,  ' cens  '  ) 

survplot ( npsurv  (r  ~  group,  data  =  kprat s )  , 
conf = 'none  '  ,  xlab= 'Residual  '  , 

label . curves  =  list (keys  =  '  lines  '  )  ,  levels . only  =  TRUE) 
survplot ( npsurv (r  ~  1),  conf = 'none',  add  =  TRUE ,  col=  'red  '  ) 
lines  (r,  lwd  =  l ,  col='blue')  #  Figure  18.11 


As  an  example,  Figure  18.11  shows  the  Kaplan-Meier  estimate  of  the  dis¬ 
tribution  of  residuals,  Kaplan-Meier  estimates  stratified  by  group,  and  the 
assumed  log-logistic  distribution. 
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Fig.  18.11  Kaplan-Meier  estimates  of  distribution  of  standardized  censored  residu¬ 
als  from  the  log-logistic  model,  along  with  the  assumed  standard  log-logistic  distri¬ 
bution  (dashed  curve).  The  step  functions  in  red  is  the  estimated  distribution  of  all 
residuals,  and  the  step  functions  in  black  are  the  estimated  distributions  of  residuals 
stratified  by  group,  as  indicated.  The  blue  curve  is  the  assumed  log-logistic  distribu¬ 
tion. 


Section  19.2  has  a  more  in-depth  example  of  this  approach. 


18.3.7  Validating  the  Fitted  Model 

AFT  models  may  be  validated  for  both  calibration  and  discrimination  accu¬ 
racy  using  the  same  methods  that  are  presented  for  the  Cox  model  in  Sec¬ 
tion  20.11.  The  methods  discussed  there  for  checking  calibration  are  based  on 
choosing  a  single  follow-up  time.  Checking  the  distributional  assumptions  of 
the  parametric  model  is  also  a  check  of  calibration  accuracy  in  a  sense.  An¬ 
other  indirect  calibration  assessment  may  be  obtained  from  a  set  of  Cox-Snell 
residuals  (Section  18.3.5)  or  by  using  ordinary  residuals  as  just  described.  A 
higher  resolution  indirect  calibration  assessment  based  on  plotting  individual 
uncensored  failure  times  is  available  when  the  theoretical  censoring  times  for 
those  observations  are  known.  Let  C  denote  a  subject’s  censoring  time  and  F 
the  cumulative  distribution  of  a  failure  time  T.  The  expected  value  of  F(T\X) 
is  0.5  when  T  is  an  actual  failure  time  random  variable.  The  expected  value 
for  an  event  time  that  is  observed  because  it  is  uncensored  is  the  expected 
value  of  F(T\T  <  C,X)  =  Q.5F(C\X).  A  smooth  plot  (using,  say,  loess)  of 
F(T\X)  —  Q.5F(C\X)  against  Xfi  should  be  a  flat  line  through  y  =  0  if  the 
model  is  well  calibrated.  A  smooth  plot  of  2F(T\X)/F(C\X)  against  Xfi  (or 
anything  else)  should  be  a  flat  line  through  y  =  1.  This  method  assumes  that 
the  model  is  calibrated  well  enough  that  we  can  substitute  1  —  S(C\X)  for 
F(C \X). 


18.8  Time-Dependent  Covariates 
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18.4  Buckley— James  Regression  Model 

Buckley  and  James81  developed  a  method  for  estimating  regression  coeffi¬ 
cients  using  least  squares  after  imputing  censored  residuals.  Their  method 
does  not  assume  a  distribution  for  survival  time  or  the  residuals,  but  is  aimed 
at  estimating  expected  survival  time  or  expected  log  survival  time  given  pre¬ 
dictor  variables.  This  method  has  been  generalized  to  allow  for  smooth  non¬ 
linear  effects  and  interactions  in  the  S  bj  function  in  the  rms  package,  written 
by  Stare  and  Harrell585. 


18.5  Design  Formulations 

Various  designs  can  be  formulated  with  survival  regression  models  just  as 
with  other  regression  models.  By  constructing  the  proper  dummy  variables, 
ANOVA  and  ANOCOVA  models  can  easily  be  specified  for  testing  differences 
in  survival  time  between  multiple  treatments.  Interactions  and  complex  non¬ 
linear  effects  may  also  be  modeled. 


18.6  Test  Statistics 

As  discussed  previously,  likelihood  ratio,  score,  and  Wald  statistics  can  be 
derived  from  the  maximum  likelihood  analysis,  and  the  choice  of  test  statistic 
depends  on  the  circumstance  and  on  computational  convenience. 


18.7  Quantifying  Predictive  Ability 

See  Section  20.10  for  a  generalized  measure  of  concordance  between  predicted 
and  observed  survival  time  (or  probability  of  survival)  for  right-censored  data. 


18.8  Time-Dependent  Covariates 

Time-dependent  covariates  (predictors)  requires  special  likelihood  functions 
and  add  significant  complexity  to  analyses  in  exchange  for  greater  ver¬ 
satility  and  enhanced  predictive  discrimination  4 .  Nicolaie  et  al.47^  and 
D’Agostino  et  al.  45  provide  useful  static  covariate  approaches  to  modeling 
time-dependent  predictors  using  landmark  analysis. 
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18.9  R  Functions 

Therneau’s  survreg  function  (part  of  his  survival  package)  can  fit  regression 
models  in  the  AFT  family  with  left-,  right-,  or  interval-censoring.  The  time 
variable  can  be  untransformed  or  log-transformed  (the  default).  Distributions 
supported  are  extreme  value  (Weibull  and  exponential),  normal,  logistic,  and 
Student-t.  The  version  of  survreg  in  rms  that  fits  parametric  survival  models 
in  the  same  framework  as  lrm,  ols,  and  cph  is  called  psm.  psm  works  with 
print,  coef,  formula,  specs,  summary,  anova,  predict,  Predict,  fastbw,  latex, 
nomogram,  validate,  calibrate,  survest,  and  survplot  functions  for  obtaining 
and  plotting  predicted  survival  probabilities.  The  dist  argument  to  psm  can  be 
"exponential",  "extreme",  "gaussian",  "logistic",  "loglogistic" ,  "lognormal", 
"t",  or  "weibull".  To  fit  a  model  with  no  covariables,  use  the  command 

L 

psm ( Surv ( d . t ime ,  event)  ~  1) 

To  restate  a  Weibull  or  exponential  model  in  PH  form,  use  the  pphsm  function. 
An  example  of  how  many  of  the  functions  are  used  is  found  below. 

L 

unit  s  (  d  .  t  ime  )  V-  "Year" 

f  V-  psm  (  Surv  (  d  .  t  ime  ,  cdeath  )  ~  lsp  (  age  ,  65 )  *  sex  ) 

#  default  is  Weibull 
anova  ( f ) 

summary(f)  #  summarize  effects  with  delta  log  T 

latex(f)  #  typeset  math,  form  of  fitted  model 

survest (f ,  times=l)  #  ly  survival  est.  for  all  subjects 
survest (f ,  expand . grid ( sex  =  " f emale " ,  age=30:80)  ,  times=l:2) 

#  ly ,  2y  survival  estimates  vs.  age ,  for  females 
survest (f ,  data. frame (sex=" female " ,age=50)) 

#  survival  curve  for  an  individual  subject 

survplot (f ,  sex=NA ,  age=50,  n.risk=T) 

#  survival  curves  for  each  sex ,  adjusting  age  to  50 

f  .ph  V-  pphsm  (f)  #  convert  from  AFT  to  PH 

summary ( f . ph )  #  summarize  with  hazard  ratios 

#  instead  of  changes  in  log(T) 

Special  functions  work  with  objects  created  by  psm  to  create  S  functions  that 
contain  the  analytic  form  for  predicted  survival  probabilities  (Survival),  haz¬ 
ard  functions  (Hazard),  quantiles  of  survival  time  (Quantile),  and  mean  or 
expected  survival  time  (Mean).  Once  the  S  functions  are  constructed,  they  can 
be  used  in  a  variety  of  contexts.  The  survplot  and  survest  functions  have 
a  special  argument  for  psm  fits:  what.  The  default  is  what=" survival"  to  esti¬ 
mate  or  plot  survival  probabilities.  Specifying  what= "hazard"  will  plot  hazard 
functions.  Predict  also  has  a  special  argument  for  psm  fits:  time.  Specifying  a 
single  value  for  time  results  in  survival  probability  for  that  time  being  plotted 
instead  of  X/3.  Examples  of  many  of  the  functions  appear  below,  with  the 
output  of  the  survplot  command  shown  in  Figure  18.12. 

med  V-  Quantile (fl) 
meant  V-  Mean(fl) 


18.9  R  Functions 
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haz  Hazard (fl) 

surv  Survival  (fl) 

latex ( surv  ,  f ile  =  '  '  ,  type  =  '  S input  '  ) 


surv 

function  (times  = 

NULL 

,  lp  =  NULL  , 

parms  = 

-2 . 1 

5437773933124  ) 

1 

1/(1 

+  exp  (  (  logb  (  t  imes 

)  - 

lp ) /exp (parms ) ) ) 

} 

#  Plot  estimated  hazard  functions  and  add  median 

#  survival  times  to  graph 

survplot  (fl  ,  group,  what  =  " hazard  "  )  #  Figure  18.12 

#  Compute  median  survival  time 
m  med ( lp  =  predict  (f 1  , 

data. frame ( group =levels (kprats$group)))) 


m 


1  2 
216.0857  240.0328 


med ( lp= range (fl$linear. predictors )) 


L 


[1]  216.0857  240.0328 


m  format (m,  digits=3) 

text(68,  .02,  paste (" Group  1  median:  ",  m  [1]  ,  " \n " , 

"Group  2  median:  ",  m[2],  sep="")) 

#  Compute  survival  probability  at  210  days 
xbeta  predict (fl , 

data . frame (group=c (" Group  1", "Group  2"))) 

surv  (210 ,  xbeta) 


1  2 
0.5612718  0.7599776 

The  S  object  called  survreg. distributions  in  Therneau’s  survival  package 
and  the  object  survreg. auxinfo  in  the  rms  package  have  detailed  information 
for  extreme- value,  logistic,  normal,  and  t  distributions.  For  each  distribution, 
components  include  the  deviance  function,  an  algorithm  for  obtaining  starting 
parameter  estimates,  a  ITTf^X  representation  of  the  survival  function,  and  S 
functions  defining  the  survival,  hazard,  quantile  functions,  and  basic  survival 
inverse  function  (which  could  have  been  used  in  Figure  18.9).  See  Figure  18.6 
for  examples,  rms’s  val.surv  function  is  useful  for  indirect  external  valida¬ 
tion  of  parametric  models  using  Cox-Snell  residuals  and  other  approaches  of 
Section  18.3.7.  The  plot  method  for  an  object  created  by  val.surv  makes  it 
easy  to  stratify  all  computations  by  a  variable  of  interest  to  more  stringently 
validate  the  fit  with  respect  to  that  variable. 

rms’s  bj  function  fits  the  Buckley- James  model  for  right-censored  re¬ 
sponses. 
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Fig.  18.12  Estimated  hazard  functions  for  log-logistic  fit  to  rat  vaginal  cancer  data, 
along  with  median  survival  times. 


Kooperberg  et  aids  adaptive  linear  spline  log-hazard  model  00,361,59  has 
been  implemented  in  the  S  function  hare.  Their  procedure  searches  for  second- 
order  interactions  involving  predictors  (and  linear  splines  of  them)  and  linear 
splines  in  follow-up  time  (allowing  for  non-proportional  hazards),  hare  is  also 
used  to  estimate  calibration  curves  for  parametric  survival  models  (rms  func¬ 
tion  calibrate)  as  it  is  for  Cox  models. 


18.10  Further  Reading 


2 


3 


4 

5 


Wellek65'  developed  a  test  statistic  for  a  specified  maximum  survival  difference 
after  relating  this  difference  to  a  hazard  ratio. 

Hougaard308  compared  accelerated  failure  time  models  with  proportional  haz¬ 
ard  models. 

Gore  et  al.226  discuss  how  an  AFT  model  (the  log-logistic  model)  gives  rise  to 
varying  hazard  ratios. 

See  Hillis293  for  other  types  of  residuals  and  plots  that  use  them. 

See  Gore  et  al.226  and  Lawless382  for  other  methods  of  checking  assumptions  for 
AFT  models.  Lawless  is  an  excellent  text  for  in-depth  discussion  of  parametric 
survival  modeling.  Kwong  and  Hutton369  present  other  methods  of  choosing 
parametric  survival  models,  and  discuss  the  robustness  of  estimates  when  fitting 
an  incorrectly  chosen  accelerated  failure  time  model. 


18.11  Problems 
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18.11  Problems 

1.  For  the  failure  times  (in  days) 

1  3  3+  6+  7+ 


compute  MLEs  of  the  following  parameters  of  an  exponential  distribution 
by  hand:  A,  /x,  To. 5,  and  5(3  days).  Compute  0.95  confidence  limits  for  A 
and  5(3),  basing  the  latter  on  log[yl(t)]. 

2.  For  the  same  data  in  Problem  1,  compute  MLEs  of  parameters  of  a  Weibull 
distribution.  Also  compute  the  MLEs  of  5(3)  and  To. 5. 


Chapter  19 

Case  Study  in  Parametric  Survival 
Modeling  and  Model  Approximation 


Consider  the  random  sample  of  1000  patients  from  the  SUPPORT  study,352 
described  in  Section  3.12.  In  this  case  study  we  develop  a  parametric  sur¬ 
vival  time  model  (accelerated  failure  time  model)  for  time  until  death  for  the 
acute  disease  subset  of  SUPPORT  (acute  respiratory  failure,  multiple  organ 
system  failure,  coma).  We  eliminate  the  chronic  disease  categories  because 
the  shapes  of  the  survival  curves  are  different  between  acute  and  chronic  dis¬ 
ease  categories.  To  fit  both  acute  and  chronic  disease  classes  would  require  a 
log-normal  model  with  a  parameter  that  is  disease-specific. 

Patients  had  to  survive  until  day  3  of  the  study  to  qualify.  The  baseline 
physiologic  variables  were  measured  during  day  3. 


19.1  Descriptive  Statistics 

First  we  create  a  variable  acute  to  flag  the  categories  of  interest,  and  print 
univariable  descriptive  statistics  for  the  data  subset. 

L 

require ( rms ) 


get  Hdat  a  (  support )  #  Get  data  frame  from  web  site 

acute  V-  support  $ dzclas s  0/0in0/0  c  (  '  ARF/MOSF  '  ,  '  Coma  '  ) 
latex  (describe  (  support  [acute  ,])  >  f  ile  =  '  ') 


©  Springer  International  Publishing  Switzerland  2015 

F.E.  Harrell,  Jr.,  Regression  Modeling  Strategies ,  Springer  Series 

in  Statistics,  DOI  10.1007/978-3-319-19425-7_19 
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support  [acute, 

35  Variables  537  Observations 


age  :  Age  . . . . . . . m 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

537  0  529  1  60.7  28.49  35.22  47.93  63.67  74.49  81.54  85.56 


lowest  :  18.04  18.41  19.76  20.30  20.31 
highest:  91.62  91.82  91.93  92.74  95.51 


death  :  Death  at  any  time  up  to  NDI  date:31DEC94 

n  missing  unique  Info  Sum  Mean 

537  0  2  0.67  356  0.6629 

sex 

n  missing  unique 
537  0  2 

female  (251,  47°/.),  male  (286,  53°/0) 


hospdead  :  Death  in  Hospital 

n  missing  unique  Info  Sum  Mean 

537  0  2  0.7  201  0.3743 


slos  :  Days  from  Study  Entry  to  Discharge  lllllllim.,.., . . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

537  0  85  1  23.44  4.0  5.0  9.0  15.0  27.0  47.4  68.2 

lowest  :  3  4  5  6  7,  highest:  145  164  202  236  241 


d.time  :  Days  of  Follow-Up  I. . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

537  0  340  1  446.1  4  6  16  182  724  1421  1742 

lowest  :  3  4  5  6  7,  highest:  1977  1979  1982  2011  2022 


dzgroup 

n  missing  unique 

537  0  3 

ARF/M0SF  w/Sepsis  (391,  73'/.),  Coma  (60,  11°/.),  M0SF  w/Malig  (86,  16°/0) 


dzclass 

n  missing  unique 
537  0  2 

ARF/M0SF  (477,  89'/.),  Coma  (60,  11%) 


num.co  :  number  of  comorbidities 

n  missing  unique  Info  Mean 

537  0  7  0.93  1.525 

0  1  2  3  4  5  6 

Frequency  111  196  133  51  31  10  5 

%  21  36  25  9  6  2  1 
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edu  :  Years  of  Education  . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

411  126  22  0.96  12.03  7  8  10  12  14  16  17 

lowest  :  01234,  highest:  17  18  19  20  22 


income 

n  missing  unique 
335  202  4 

under  $llk  (158,  47*/.),  $ll-$25k  (79,  24*/.),  $25-$50k  (63,  19°/0) 
>$50k  (35,  10°/0) 


scoma  :  SUPPORT  Coma  Score  based  on  Glasgow  D3 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

537  0  11  0.82  19.24  0  0  0  0  37  55  100 

0  9  26  37  41  44  55  61  89  94  100 
Frequency  301  50  44  19  17  43  11  6  8  6  32 

°/0  56  984382111  6 


charges  :  Hospital  Charges  lllllllll . . . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

517  20  516  1  86652  11075  15180  27389  51079  100904  205562  283411 

lowest  :  3448  4432  4574  5555  5849 
highest:  504660  538323  543761  706577  740010 


totcst  :  Total  RCC  cost  .lllllilllil . . . . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

471  66  471  1  46360  6359  8449  15412  29308  57028  108927  141569 

lowest  :  0  2071  2522  3191  3325 

highest:  269057  269131  338955  357919  390460 


totmcst  :  Total  micro-cost 

n  missing  unique  Info  Mean  .05  .10 


.25 


.'75 


.95 


331  206  328  1  39022  6131  8283  14415  26323  54102  87495  111920 


lowest  :  0  1562  2478  2626  3421 

highest:  144234  154709  198047  234876  271467 


avtisst  :  Average  TISS,  Days  3—25  ..  . . JlllillllillililllllllLlit 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

536  1  205  1  29.83  12.46  14.50  19.62  28.00  39.00  47.17  50.37 

lowest  :  4.000  5.667  8.000  9.000  9.500 
highest:  58.500  59.000  60.000  61.000  64.000 


race 

n  missing  unique 
535  2  5 

white  black  asian  other  hispanic 
Frequency  417  84  4  8  22 

°/0  78  16  1  1  4 
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meanbp  :  Mean  Arterial  Blood  Pressure  Day  3  . iliiJlIilm,,  . . . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

537  0  109  1  83.28  41.8  49.0  59.0  73.0  111.0  124.4  135.0 

lowest  :  0  20  27  30  32,  highest:  155  158  161  162  180 


wblc  :  White  Blood  Cell  Count  Day  3  L.iillllllllllhli . . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

532  5  241  1  14.1  0.8999  4.5000  7.9749  12.3984  18.1992  25.1891  30.1873 

lowest  :  0.05000  0.06999  0.09999  0.14999  0.19998 
highest:  51.39844  58.19531  61.19531  79.39062  100.00000 


hrt  :  Heart  Rate  Day  3  , . .,.,.IImiI,....,.hIIiIi|IiI,ii.i.„.. . , . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

537  0  111  1  105  51  60  75  111  126  140  155 

lowest  :  0  11  30  36  40,  highest:  189  193  199  232  300 

resp  :  Respiration  Rate  Day  3  .  .. i.l J.i.,  ,  1. 1. 1 ,i .1,1.1. „i  ..I . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

537  0  45  1  23.72  8  10  12  24  32  39  40 

lowest  :  04678,  highest:  48  49  52  60  64 

temp  :  Temperature  (celcius)  Day  3  . . I  Jllll  illin..  .,ml  llllll  llllll . . . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

537  0  61  1  37.52  35.50  35.80  36.40  37.80  38.50  39.09  39.50 

lowest  :  32.50  34.00  34.09  34.90  35.00 
highest:  40.20  40.59  40.90  41.00  41.20 

pafi  :  PaO2/(.01*FiO2)  Day  3  ..i.iiiilllilllliliiiilii.lilli . . . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

500  37  357  1  227.2  86.99  105.08  137.88  202.56  290.00  390.49  433.31 

lowest  :  45.00  48.00  53.33  54.00  55.00 
highest:  574.00  595.12  640.00  680.00  869.38 

alb  :  Serum  Albumin  Day  3  . ,  1 1  i  I  1 1  I  1 1  I  1 1  I  I  1 1  i  i  i  . .  .  .  , . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

346  191  34  1  2.668  1.700  1.900  2.225  2.600  3.100  3.400  3.800 

lowest  :  1.100  1.200  1.300  1.400  1.500 
highest:  4.100  4.199  4.500  4.699  4.800 

bili  :  Bilirubin  Day  3  iL, .  ... 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

386  151  88  1  2.678  0.3000  0.4000  0.6000  0.8999  2.0000  6.5996  13.1743 

lowest  :  0.09999  0.19998  0.29999  0.39996  0.50000 

highest:  22.59766  30.00000  31.50000  35.00000  39.29688 

crea  :  Serum  creatinine  Day  3  .illllll . . . . . . . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

537  0  84  1  2.232  0.6000  0.7000  0.8999  1.3999  2.5996  5.2395  7.3197 

lowest  :  0.3  0.4  0.5  0.6  0.7,  highest:  10.4  10.6  11.2  11.6  11.8 


19.1  Descriptive  Statistics 
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sod  :  Serum  sodium  Day  3  ...  .. . .  il  1 1  III  I  li  ■  I  h  1 1  „ .  ■ . . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

537  0  38  1  138.1  129  131  134  137  142  147  150 

lowest  :  118  120  121  126  127,  highest:  156  157  158  168  175 

ph  :  Serum  pH  (arterial)  Day  3  . . .  I. .  ill  illll  III  Hill 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

500  37  49  1  7.416  7.270  7.319  7.380  7.420  7.470  7.510  7.529 

lowest  :  6.960  6.989  7.069  7.119  7.130 
highest:  7.560  7.569  7.590  7.600  7.659 

glucose  :  Glucose  Day  3  . iiilllllllllilii,iiiii..i„.,.,.i,.i.. ,  , . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

297  240  179  1  167.7  76.0  89.0  106.0  141.0  200.0  292.4  347.2 

lowest  :  30  42  52  55  68,  highest:  446  468  492  576  598 

bun  :  BUN  Day  3  ..miJiIiiIiUIiiJiLu.Imi  . , . , . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

304  233  100  1  38.91  8.00  11.00  16.75  30.00  56.00  79.70  100.70 

lowest  :  13456,  highest:  123  124  125  128  146 

urine  :  Urine  Output  Day  3  . . . . . . „ . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

303  234  262  1  2095  20.3  364.0  1156.5  1870.0  2795.0  4008.6  4817.5 

lowest  :  0  5  8  15  20,  highest:  6865  6920  7360  7560  7750 


adlp  :  ADL  Patient  Day  3 

n  missing  unique  Info  Mean 

104  433  8  0.87  1.577 

0  1234567 

Frequency  51  19  7  6  4  7  8  2 

°/0  49  18  7  6  4  7  8  2 


adls  :  ADL  Surrogate  Day  3 

n  missing  unique  Info  Mean 

392  145  8  0.89  1.86 

01234567 
Frequency  185  68  22  18  17  20  39  23 

°/0  47  17  6  5  4  5  10  6 


sfdm2 

n  missing  unique 
468  69  5 

no(M2  and  SIP  pres)  (134,  29°/0)  ,  adl>=4  (>=5  if  sur)  (78,  17°/0) 

SIP>=30  (30,  6°/0)  ,  Coma  or  Intub  (5,  l°/0)  ,  <2  mo.  follow-up  (221,  47°/0) 
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adlsc  :  Imputed  ADL  Calibrated  to  Surrogate  I  .  i . 

n  missing  unique  Info  Mean  .05  .10  .25  .50  .75  .90  .95 

537  0  144  0.96  2.119  0.000  0.000  0.000  1.839  3.375  6.000  6.000 

lowest  :  0.0000  0.4948  0.4948  1.0000  1.1667 
highest:  5.7832  6.0000  6.3398  6.4658  7.0000 


Next,  patterns  of  missing  data  are  displayed. 

plot (naclus  ( support  [acute,]))  #  Figure  19.1 

y 

The  Hmisc  varclus  function  is  used  to  quantify  and  depict  associations  between 
predictors,  allowing  for  general  nonmonotonic  relationships.  This  is  done  by 
using  Hoeffding’s  D  as  a  similarity  measure  for  all  possible  pairs  of  predictors 
instead  of  the  default  similarity,  Spearman’s  p. 


ac  V-  support [acute ,] 

L 

ac$dzgroup  V-  ac $dzgroup  [drop  =  TRUE] 

#  Remove  unused  levels 

label  (  ac  $  dzgroup  )  V-  'Disease  Group 

i 

attach ( ac ) 

vc  V-  varclus (~  age  +  sex  +  dzgroup 

+  num.co  +  edu  +  income  + 

scoma  +  race  +  meanbp 

+  wblc  +  hrt  +  resp  + 

temp  +  pafi  +  alb  +  bili  +  crea  +  sod  +  ph  + 

glucose  +  bun  +  urine 

+  adlsc  ,  sim= 'hoeffding  ') 

plot ( vc ) 

#  Figure  19.2 

19.2  Checking  Adequacy  of  Log-Normal  Accelerated 
Failure  Time  Model 

Let  us  check  whether  a  parametric  survival  time  model  will  fit  the  data,  with 
respect  to  the  key  prognostic  factors.  First,  Kaplan-Meier  estimates  stratified 
by  disease  group  are  computed,  and  plotted  after  inverse  normal  transforma¬ 
tion,  against  logt.  Parallelism  and  linearity  indicate  goodness  of  fit  to  the 
log  normal  distribution  for  disease  group.  Then  a  more  stringent  assessment 
is  made  by  fitting  an  initial  model  and  computing  right-censored  residuals. 
These  residuals,  after  dividing  by  <r,  should  all  have  a  normal  distribution 
if  the  model  holds.  We  compute  Kaplan-Meier  estimates  of  the  distribution 
of  the  residuals  and  overlay  the  estimated  survival  distribution  with  the  the¬ 
oretical  Gaussian  one.  This  is  done  overall,  and  then  to  get  more  stringent 
assessments  of  fit,  residuals  are  stratified  by  key  predictors  and  plots  are 
produced  that  contain  multiple  Kaplan-Meier  curves  along  with  a  single  the¬ 
oretical  normal  curve.  All  curves  should  hover  about  the  normal  distribution. 
To  gauge  the  natural  variability  of  stratified  residual  distribution  estimates, 
the  residuals  are  also  stratified  by  a  random  number  that  has  no  bearing  on 
the  goodness  of  fit. 

L 

dd  V-  datadist(ac) 

#  describe  distributions  of  variables  to  rms 
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Fig.  19.1  Cluster  analysis  showing  which  predictors  tend  to  be  missing  on  the  same 
patients 
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Fig.  19.2  Hierarchical  clustering  of  potential  predictors  using  Hoeffding  D  as  a 
similarity  measure.  Categorical  predictors  are  automatically  expanded  into  dummy 
variables. 
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verse  Kaplan- 
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by 
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survplot (npsurv  ( 
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t 

f  un  = 

qno 

rm  , logt  =  TRUE ) 

# 

Figure 

19.3 
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f  V-  psm(S  ~  dzgroup  +  res  (age, 5)  +  res  (meanbp  ,  5)  , 

dist=  1  lognormal  '  ,  y  =  TRUE) 
r  V-  resid  (f ) 

survplot  (r  ,  dzgroup,  label .  curve  =FALSE  ) 
survplot  (r  ,  age,  label . curve =FALSE ) 

survplot  (r,  meanbp,  label . curve =FALSE ) 

random  V-  runif ( length ( age )) ;  label ( random )  V-  'Random  Number' 
survplot  (r,  random,  label . curve =FALSE )  #  Fig.  19.4 

Now  remove  from  consideration  predictors  that  are  missing  in  more  than  0.2 
of  patients.  Many  of  these  were  collected  only  for  the  second  half  of  SUP¬ 
PORT.  Of  those  variables  to  be  included  in  the  model,  find  which  ones  have 
enough  potential  predictive  power  to  justify  allowing  for  nonlinear  relation¬ 
ships  or  multiple  categories,  which  spend  more  d.f.  For  each  variable  compute 
Spearman  p 2  based  on  multiple  linear  regression  of  rank(x),  rank(x)2,  and  the 


Fig.  19.3  <P  1  (*SNm(£))  stratified  by  dzgroup.  Linearity  and  semi-parallelism  indi¬ 
cate  a  reasonable  fit  to  the  log-normal  accelerated  failure  time  model  with  respect  to 
one  predictor. 


survival  time,  truncating  survival  time  at  the  shortest  follow-up  for  survivors 
(356  days;  see  Section  4.1). 

L 

shortest .follow.up  <—  min(d.time [death==0] ,  na.rm=TRUE) 
d. timet  pmin(d.time,  short est . f ollow . up ) 

w  V-  spearman2  (d.timet  ~  age  +  num.co  +  scoma  +  meanbp  + 

hrt  +  resp  +  temp  +  crea  +  sod  +  adlsc  + 
wblc  +  pafi  +  ph  +  dzgroup  +  race ,  p=2) 
plot(w,  main= ' ')  #  Figure  19.5 


19.2  Checking  Adequacy  of  Log-Normal  Model 
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A  better  approach  is  to  use  the  complete  information  in  the  failure  and  censor¬ 
ing  times  by  computing  Somers’  Dxy  rank  correlation  allowing  for  censoring. 


w  V-  rcorrcens  (S  ~  age  +  num.co  +  scoma  +  meanbp  +  hrt  +  resp  + 

temp  +  crea  +  sod  +  adlsc  +  wblc  +  pafi  +  ph  + 
dzgroup  +  race) 

plot(w,  main=  1  ')  #  Figure  19.6 

Remaining  missing  values  are  imputed  using  the  “most  normal”  values,  a 
procedure  found  to  work  adequately  for  this  particular  study.  Race  is  imputed 
using  the  modal  category. 

L 

#  Compute  number  of  missing  values  per  variable 

s apply  (llist  (  age  ,  num .  co  ,  scoma  ,  meanbp  ,  hrt  ,  resp  ,  temp  ,  crea  ,  sod  , 

adlsc  , wblc  , paf i  ,  ph )  ,  function(x)  sum ( i s . na ( x ) ) ) 


age  num.co  scoma  meanbp  hrt  resp  temp  crea  sod  adlsc 

0000000000 
wblc  pafi  ph 

5  37  37 


Disease  Group 


Age 


Residual 


Residual 


Mean  Arterial  Blood  Pressure  Day  3 


Random  Number 


Residual 


Residual 


Fig.  19.4  Kaplan-Meier  estimates  of  distributions  of  normalized,  right-censored 
residuals  from  the  fitted  log-normal  survival  model.  Residuals  are  stratified  by  im¬ 
portant  variables  in  the  model  (by  quartiles  of  continuous  variables),  plus  a  random 
variable  to  depict  the  natural  variability  (in  the  lower  right  plot).  Theoretical  standard 
Gaussian  distributions  of  residuals  are  shown  with  a  thick  solid  line. 
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N  df 
537  2 
537  2 
537  2 
537  2 
500  2 
500  2 
537  2 
537  2 
537  2 
537  2 
532  2 
537  2 
537  2 
537  2 
535  4 

0.00  0.02  0.04  0.06  0.08  0.10  0.12 

Adjusted  p2 

Fig.  19.5  Generalized  Spearman  p 2  rank  correlation  between  predictors  and  trun¬ 
cated  survival  time 
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Fig.  19.6  Somers’  Dxy  rank  correlation  between  predictors  and  original  survival 
time.  For  dzgroup  or  race,  the  correlation  coefficient  is  the  maximum  correlation  from 
using  a  dummy  variable  to  represent  the  most  frequent  or  one  to  represent  the  second 
most  frequent  category.’,scap=’ Somers’  Dxy  rank  correlation  between  predictors  and 
original  survival  time 
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Now  that  missing  values  have  been  imputed,  a  formal  multivariable  redun¬ 
dancy  analysis  can  be  undertaken.  The  Hmisc  package’s  redun  function  goes 
farther  than  the  varclus  pairwise  correlation  approach  and  allows  for  non¬ 
monotonic  transformations  in  predicting  each  predictor  from  all  the  others. 

L 

redun  (~  crea  +  age  +  sex  +  dzgroup  +  num.co  +  scoma  +  adlsc  + 
race2  +  meanbp  +  hrt  +  resp  +  temp  +  sod  +  wblc.i  + 
pafi.i  +  ph .  i  ,  nk=4) 


Redundancy  Analysis 

redun ( f ormula  =  ~crea  +  age  +  sex  +  dzgroup  +  num.co  +  scoma  + 
adlsc  +  race2  +  meanbp  +  hrt  +  resp  +  temp  +  sod  +  wblc . i  + 
pafi.i  +  ph . i ,  nk  =  4) 

n:  537  p:  16  nk :  4 

Number  of  NAs :  0 

Transformation  of  target  variables  forced  to  be  linear 
R2  cutoff:  0.9  Type:  ordinary 

Q 

R  with  which  each  variable  can  be  predicted  from  all  other  variables: 


crea 

age 

sex  dzgroup 

num . co 

scoma 

adlsc 

r  ace2 

meanbp 

0 . 133 

0 . 246 

0.132  0.451 

0 . 147 

0 .418 

0 . 153 

0 . 151 

0 . 178 

hrt 

resp 

temp  sod 

wblc  .  i 

pafi.i 

ph  .  i 

0 . 258 

0 . 131 

0.197  0.135 

0 . 093 

0 . 143 

0 . 171 

No  redundant  variables 

Now  turn  to  a  more  efficient  approach  for  gauging  the  potential  of  each 
predictor,  one  that  makes  maximal  use  of  failure  time  and  censored  data  is  to 
all  continuous  variables  to  have  a  maximum  number  of  knots  in  a  log-normal 
survival  model.  This  approach  must  use  imputation  to  have  an  adequate 
sample  size.  A  semi-saturated  main  effects  additive  log-normal  model  is  fitted. 
It  is  necessary  to  limit  restricted  cubic  splines  to  4  knots,  force  scoma  to  be 
linear,  and  to  omit  ph.i  in  order  to  avoid  a  singular  covariance  matrix  in 
the  fit. 

k  V-  4 

f  V-  psm(S  rsj  res  (age  ,  k)  +  sex  +  dzgroup  +  pol  (num  .  co  ,  2)  +  scoma  + 
pol(  adlsc  ,2)  +  race  +  rcs  (meanbp  ,  k)  +  rcs  (hrt  ,  k)  + 
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res  ( resp  , k)+rcs  (temp  ,k)  +  rcs(crea  ,3)  +  rcs(sod  ,  k)  + 
res (wblc . i  ,k)+rcs (pafi.i  ,k)  ,  dist  =  '  lognormal  '  ) 
plot ( anova ( f ) )  #  Figure  19.7 

Figure  19.7  properly  blinds  the  analyst  to  the  form  of  effects  (tests  of  lin¬ 
earity).  Next  fit  a  log-normal  survival  model  with  number  of  parameters 
corresponding  to  nonlinear  effects  determined  from  the  partial  y2  tests  in 
Figure  19.7.  For  the  most  promising  predictors,  five  knots  can  be  allocated, 
as  there  are  fewer  singularity  problems  once  less  promising  predictors  are 
simplified. 


sex 

temp 

race 

sod 

num.co 

hrt 

wblc.i 

adlsc 

resp 

scoma 

pafi.i 

age 

meanbp 

crea 

dzgroup 


X2  -  df 

Fig.  19.7  Partial  y2  statistics  for  association  of  each  predictor  with  response  from 
saturated  main  effects  model,  penalized  for  d.f. 


f  V-  psm(S  ~  res (age , 5) + sex+dzgroup+num . co + 

scoma+pol ( adlsc  ,2)+race2+rcs (meanbp ,5)  + 
res (hrt  ,3)  +  rcs (resp  ,3)  +  temp  + 

res (crea ,4)  +  sod  +  rcs (wblc . i  ,3)  +  rcs(pafi.i  ,4)  > 
dist  = ' lognormal  '  ) 
print  (f  ,  latex=TRUE,  coefs=FALSE) 


Parametric  Survival  Model:  Log  Normal  Distribution 

psm (formula  =  S  res (age,  5)  +  sex  +  dzgroup  +  num.co  +  scoma  + 

pol (adlsc,  2)  +  race2  +  res (meanbp,  5)  +  res (hrt,  3)  +  res (resp, 

3)  +  temp  +  res (crea,  4)  +  sod  +  res (wblc.i,  3)  +  res (pafi.i, 

4) ,  dist  =  "lognormal") 
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Model  Likelihood 
Ratio  Test 

Discrimination 

Indexes 

Obs  537 
Events  356 
cr  2.230782 

LR  x2  236.83 

d.f.  30 

Pr(>  x2)  <  0.0001 

fi2  0.594 

Dxy  0.485 

g  0.033 

gr  1.959 

a  V-  anova ( f ) 


Table  19.1  Wald  Statistics  for  S 


x2 

d.f. 

P 

age 

15.99 

4 

0.0030 

Nonlinear 

0.23 

3 

0.9722 

sex 

0.11 

1 

0.7354 

dzgroup 

45.69 

2  <  0.0001 

num.co 

4.99 

1 

0.0255 

scoma 

10.58 

1 

0.0011 

adlsc 

8.28 

2 

0.0159 

Nonlinear 

3.31 

1 

0.0691 

race2 

1.26 

1 

0.2624 

meanbp 

27.62 

4  <  0.0001 

Nonlinear 

10.51 

3 

0.0147 

hrt 

11.83 

2 

0.0027 

Nonlinear 

1.04 

1 

0.3090 

resp 

11.10 

2 

0.0039 

Nonlinear 

8.56 

1 

0.0034 

temp 

0.39 

1 

0.5308 

crea 

33.63 

3  <  0.0001 

Nonlinear 

21.27 

2  <  0.0001 

sod 

0.08 

1 

0.7792 

wblc.i 

5.47 

2 

0.0649 

Nonlinear 

5.46 

1 

0.0195 

pafi.i 

15.37 

3 

0.0015 

Nonlinear 

6.97 

2 

0.0307 

TOTAL  NONLINEAR 

60.48 

14  <  0.0001 

TOTAL 

261.47 

30  <  0.0001 
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19.3  Summarizing  the  Fitted  Model 

First  let’s  plot  the  shape  of  the  effect  of  each  predictor  on  log  survival  time. 
All  effects  are  centered  so  that  they  can  be  placed  on  a  common  scale.  This 
allows  the  relative  strength  of  various  predictors  to  be  judged.  Then  Wald 
X2  statistics,  penalized  for  d.f.,  are  plotted  in  descending  order.  Next,  rela¬ 
tive  effects  of  varying  predictors  over  reasonable  ranges  (survival  time  ratios 
varying  continuous  predictors  from  the  first  to  the  third  quartile)  are  charted. 


ggplot  (Predict (f,  ref. zero  =  TRUE )  ,  vnames  =' names  '  , 

sepdiscrete =  '  vert ical  '  ,  anova=a)  #  Figure  19.8 


latex  (a,  file  =  '  '  , label =  'tab:  support -ano vat  ')  #  Table  19.1 


plot (a) 


#  Figure  19.9 


options (digits  =3) 

L 

plot ( summary ( f ) ,  log=TRUE ,  main= ' ' ) 

#  Figure  19.10 

19.4  Internal  Validation  of  the  Fitted  Model 
Using  the  Bootstrap 

Let  us  decide  whether  there  was  significant  overfitting  during  the  development 
of  this  model,  using  the  bootstrap. 

L 

#  First  add  data  to  model  fit  so  bootstrap  can  re-sample 

#  from  the  data 

g  V-  update (f ,  x=TRUE ,  y=TRUE) 
set . seed  (717) 

latex ( validate (g ,  B  =  300)  ,  digits=2,  s ize =  '  Ss ize  '  ) 


Index  Original  Training  Test  Optimism  Corrected  n 


Sample  Sample  Sample  Index 


DXy 

0.49 

0.51 

0.46 

0.05 

0.43  300 

R2 

0.59 

0.66 

0.54 

0.12 

0.47  300 

Intercept 

0.00 

0.00 

-0.05 

0.05 

-0.05  300 

Slope 

1.00 

1.00 

0.90 

0.10 

0.90  300 

D 

0.48 

0.55 

0.42 

0.13 

0.35  300 

u 

0.00 

0.00 

-0.01 

0.01 

-0.01  300 

Q 

0.48 

0.56 

0.43 

0.12 

0.36  300 

9 

1.96 

2.05 

1.87 

0.19 

1.77  300 

19.4  Internal  Validation  of  the  Fitted  Model  Using  the  Bootstrap 
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Fig.  19.8  Effect  of  each  predictor  on  log  survival  time.  Predicted  values  have  been 
centered  so  that  predictions  at  predictor  reference  values  are  zero.  Pointwise  0.95 
confidence  bands  are  also  shown.  As  all  y- axes  have  the  same  scale,  it  is  easy  to  see 
which  predictors  are  strongest. 


Judging  from  Dxy  and  R2  there  is  a  moderate  amount  of  overfitting.  The 
slope  shrinkage  factor  (0.9)  is  not  troublesome,  however.  An  almost  unbiased 
estimate  of  future  predictive  discrimination  on  similar  patients  is  given  by 
the  corrected  Dxy  of  0.43.  This  index  equals  the  difference  between  the  prob¬ 
ability  of  concordance  and  the  probability  of  discordance  of  pairs  of  predicted 
survival  times  and  pairs  of  observed  survival  times,  accounting  for  censoring. 

Next,  a  bootstrap  overfitting-corrected  calibration  curve  is  estimated.  Pa¬ 
tients  are  stratified  by  the  predicted  probability  of  surviving  one  year,  such 
that  there  are  at  least  60  patients  in  each  group. 
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Fig.  19.9  Contribution  of  variables  in  predicting  survival  time  in  log-normal  model 


age  -  74.5:47.9 
num.co  -  2:1 
scoma  -  37:0 
adlsc  -  3.38:0 
meanbp  -  111:59 
hrt  -  126:75 
resp  -  32:12 
temp  -  38.5:36.4 
crea  -  2. 6:0. 9 
sod  -  142:134 
wblc.i  -  18.2:8.1 
pafi.i  -  323:142 
sex  -  female:male 
dzgroup  -  Coma:ARF/MOSF  w/Sepsis 
dzgroup  -  MOSF  w/Malig:ARF/MOSF  w/Sepsis 

race2  -  othenwhite 

Fig.  19.10  Estimated  survival  time  ratios  for  default  settings  of  predictors.  For 
example,  when  age  changes  from  its  lower  quartile  to  the  upper  quartile  (47. 9y  to 
74. 5y),  median  survival  time  decreases  by  more  than  half.  Different  shaded  areas  of 
bars  indicate  different  confidence  levels  (.9,  0.95,  0.99). 


0.10  0.50  1.00  2.00  3.50 


set . seed  (717) 

cal  calibrate (g,  u=l ,  B=300) 
plot (cal,  subt it les =FALSE ) 

cal  calibrate (g,  cmethod=  '  KM  '  ,  u  =  l  ,  m  =  60  ,  B  =  120  ,  pr  =  FALSE) 
plot (cal,  add=TRUE)  #  Figure  19.11 
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Fig.  19.11  Bootstrap  validation  of  calibration  curve.  Dots  represent  apparent  cali¬ 
bration  accuracy;  x  are  bootstrap  estimates  corrected  for  overfitting,  based  on  bin¬ 
ning  predicted  survival  probabilities  and  computing  Kaplan-Meier  estimates.  Black 
curve  is  the  estimated  observed  relationship  using  hare  and  the  blue  curve  is  the 
overfitting-corrected  hare  estimate.  The  gray-scale  line  depicts  the  ideal  relationship. 


19.5  Approximating  the  Full  Model 

The  fitted  log-normal  model  is  perhaps  too  complex  for  routine  use  and  for 
routine  data  collection.  Let  us  develop  a  simplified  model  that  can  predict 
the  predicted  values  of  the  full  model  with  high  accuracy  ( R 2  =  0.967).  The 
simplification  is  done  using  a  fast  backward  step-down  against  the  full  model 
predicted  values. 

L 

Z  V-  predict(f)  #  X*beta  hat 

a  V-  ols(Z  ~  res ( age  , 5)  +  sex  +  dzgroup  +  num . co  + 

scoma+pol ( adlsc  ,2)+race2+ 
rcs (meanbp  , 5)  +  rcs (hrt  ,3)+rcs(resp  ,  3)  + 
temp+rcs (crea ,4)+sod+rcs (wblc. i ,  3)  + 
rcs(pafi.i ,4) >  sigma=l) 

#  sigma=l  is  used  to  prevent  sigma  hat  from  being  zero  when 

#  R2=1.0  since  we  start  out  by  approximating  Z  with  all 

#  component  variables 

f astbw (a,  aics=10000)  #  fast  backward  stepdown 


Deleted 

Chi-Sq 

d  .  f  . 

P 

Residual 

d  .  f  . 

P 

AIC 

R2 

sod 

0 . 43 

1 

0 .512 

0 . 43 

1 

0.5117 

-1 . 57 

1 . 000 

sex 

0 . 57 

1 

0 . 451 

1 . 00 

2 

0 . 6073 

-3 . 00 

0 . 999 

temp 

2 . 20 

1 

0 . 138 

3 . 20 

3 

0 . 3621 

-2 . 80 

0 . 998 

r  ace2 

6 .81 

1 

0 . 009 

10 .01 

4 

0 . 0402 

2 .01 

0 . 994 

wblc  .  i 

29 . 52 

2 

0 . 000 

39 . 53 

6 

0 . 0000 

27 . 53 

0 . 976 
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num . co 

30 . 84 

1 

0 . 000 

70 . 36 

7 

0 . 0000 

56 . 36 

0 . 957 

resp 

54 . 18 

2 

0 . 000 

124 . 55 

9 

0 . 0000 

106 . 55 

0 . 924 

adl  s  c 

52 . 46 

2 

0 . 000 

177 . 00 

11 

0 . 0000 

155 . 00 

0 . 892 

paf i . i 

66 . 78 

3 

0 . 000 

243 . 79 

14 

0 . 0000 

215 . 79 

0 . 851 

scoma 

78 . 07 

1 

0 . 000 

321 . 86 

15 

0 . 0000 

291 . 86 

0 . 803 

hrt 

83 . 17 

2 

0 . 000 

405 . 02 

17 

0 . 0000 

371 . 02 

0 . 752 

age 

68 . 08 

4 

0 . 000 

473 . 10 

21 

0 . 0000 

431 . 10 

0 .710 

crea 

314 . 47 

3 

0 . 000 

787 . 57 

24 

0 . 0000 

739 . 57 

0 .517 

meanbp 

403 . 04 

4 

0 . 000 

1190 .61 

28 

0 . 0000 

1134 .61 

0 . 270 

dzgroup 

441 . 28 

2 

0 . 000 

1631 . 89 

30 

0 . 0000 

1571 . 89 

0 . 000 

Approximate  Est 

imat  e  s 

after 

Deleting 

F 

actors 

Coef 

S  .  E  . 

Wald  Z 

P 

[1,]  -o. 

5928  0. 

04315 

-13 . 74 

0 

Factors  in  Final  Model 


None 


f . approx  V-  ols(Z  ~  dzgroup  +  res (meanbp  ,  5)  +  res (crea ,4)  + 

res  (age  ,5)  +  res  (hrt  ,3)  +  scoma  + 

rcs(pafi.i ,4)  +  pol(adlsc  ,2)  + 

rcs(resp,3),  x=TRUE) 

f . approx$stats 


n  Model  L . R . 
537.000  1688.225 


d  .  f  . 
23 . 000 


R2 
0 . 957 


g 

915 


Sigma 
0 . 370 


We  can  estimate  the  variance-covariance  matrix  of  the  coefficients  of  the 
reduced  model  using  Equation  5.2  in  Section  5.5.2.  The  computations  below 
result  in  a  covariance  matrix  that  does  not  include  elements  related  to  the 
scale  parameter.  In  the  code  x  is  the  matrix  T  in  Section  5.5.2. 


V 

vcov (f , regcoef . on 

ly 

=TRUE ) 

# 

var  (  fu l 1 

L 

model) 

X 

cbind ( Intercept =1 

9 

g$ 

x) 

# 

fu ll  mod 

e l  design 

X 

cbind ( Intercept =1 

9 

f  . 

appr ox  $x ) 

# 

approx . 

model  des i gn 

w 

s  o  1  v  e  ( t  (  x )  °/0  *  °/o  x  , 

t 

(x 

))  °/o*°/o  X 

# 

contras t 

matrix 

V 

w  °/o*°/o  V  °/o*°/o  t(w) 

Let’s  compare  the  variance  estimates  (diagonals  of  v)  with  variance  estimates 
from  a  reduced  model  that  is  fitted  against  the  actual  outcomes. 

L 

f  .  sub  V-  psm(S  ~  dzgroup  +  res (meanbp  ,  5)  +  res  (crea, 4)  + 

rcs(age,5)  +  rcs(hrt,3)  +  scoma  +  rcs(pafi.i,4)  + 
pol(adlsc,2)+  res (resp ,3) ,  dist =  ' lognormal ' ) 

diag(v)/diag(vcov(f .sub , regcoef . only=TRUE ) ) 


Intercept 

dzgroup  =  Coma 

dzgroup=M0SF  w/Malig 

0 . 981 

0 . 979 

0 . 979 

meanbp 

meanbp  ' 

meanbp '  ' 

0 . 977 

0 . 979 

0 . 979 

meanbp  '  '  ' 

crea 

crea  1 

0 . 979 

0 . 979 

0 . 979 

crea  1  1 

age 

age  ' 

0 . 979 

0 . 982 

0 . 981 

age  1  1 

age  1  '  1 

hrt 

0 . 981 

0 . 980 

0 . 978 
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hrt  ' 

scoma 

pafi.i 

0 . 976 

0 . 979 

0 . 980 

paf i . i ' 

paf i . i ' ' 

adl  s  c 

0 . 980 

0 . 980 

0 . 981 

adlsc a2 

r  e  sp 

resp  ' 

0 . 981 

0 . 978 

0 . 977 

r  V-  diag  (  v )/ diag  (  vcov  (  f  .  sub  ,  regcoef  .  only  =TRUE  )  ) 
r [ c ( whi ch . min ( r ) ,  whi ch . max ( r ) ) ] 


hrt  '  age 
0.976  0.982 


The  estimated  variances  from  the  reduced  model  are  actually  slightly  smaller 
than  those  that  would  have  been  obtained  from  stepwise  variable  selection 
in  this  case,  had  variable  selection  used  a  stopping  rule  that  resulted  in  the 
same  set  of  variables  being  selected.  Now  let  us  compute  Wald  statistics  for 
the  reduced  model. 

!■■■■  L 

f.approx$var  V-  v 

latex (an ova (f . approx  ,  test='Chisq',  ss  =  FALSE),  f ile  =  '  '  , 
label ='  tab : support- an ovaa  '  ) 

The  results  are  shown  in  Table  19.2.  Note  the  similarity  of  the  statistics 
to  those  found  in  the  table  for  the  full  model.  This  would  not  be  the  case  had 
deleted  variables  been  very  collinear  with  retained  variables. 

The  equation  for  the  simplified  model  follows.  The  model  is  also  depicted 
graphically  in  Figure  19.12.  The  nomogram  allows  one  to  calculate  mean  and 
median  survival  time.  Survival  probabilities  could  have  easily  been  added  as 
additional  axes. 

■ . .  L 

#  Typeset  mathematical  form  of  approximate  model 
latex ( f . approx  ,  f ile  =  '  '  ) 


E(Z)  =  Xf3,  where 


xj3  = 

-2.51 

—  1.94[Coma]  —  1.75[MOSF  w/Malig] 

+0.068meanbp  —  3.08  X  10~5  (meanbp  —  41.8)  +  +  7.9  X  10~ J  (meanbp  —  61)  + 

—4.91  X  10  5 (meanbp  —  73)  +  +  2.61  X  10  6 (meanbp  —  109)  +  —  1.7 X  10  6 (meanbp  —  135)  + 
— 0.553crea  —  0.229(crea  —  0.6)  +  +  0.45(crea  —  1.1)  +  —  0.233(crea  —  1.94)  + 

+0.0131(crea  -  7.32)  + 

— 0.0165age  -  1.13  X  10~5(age  -  28.5)  +  +  4.05  X  10-5 (age  -  49.5)  + 

—  2.15  X  10-5  (age  -  63.7)  +  -  2.68  X  10-5  (age  -  72.7)  +  +  1.9x  10_5(age  -  85.6)  + 

— 0.0136hrt  +  6.09  X  10-7(hrt  -  60)^  -  1.68  X  10-6 (hrt  -  111)^  +  1.07  X  10-6 (hrt  -  140)  + 
—0.0135  scoma 

+0.0161pafi.i  -  4.77  X  10-7 (pafi.i  -  88)+  +  9.11  x  10_7(pafi.i  -  167)  + 
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Table  19.2  Wald  Statistics  for  Z 


x2 

d.f. 

P 

dzgroup 

55.94 

2  <  0.0001 

meanbp 

29.87 

4  <  0.0001 

Nonlinear 

9.84 

3 

0.0200 

crea 

39.04 

3  <  0.0001 

Nonlinear 

24.37 

2  <  0.0001 

age 

18.12 

4 

0.0012 

Nonlinear 

0.34 

3 

0.9517 

hrt 

9.87 

2 

0.0072 

Nonlinear 

0.40 

1 

0.5289 

scoma 

9.85 

1 

0.0017 

pafi.i 

14.01 

3 

0.0029 

Nonlinear 

6.66 

2 

0.0357 

adlsc 

9.71 

2 

0.0078 

Nonlinear 

2.87 

1 

0.0904 

resp 

9.65 

2 

0.0080 

Nonlinear 

7.13 

1 

0.0076 

TOTAL  NONLINEAR 

58.08 

13  <  0.0001 

TOTAL 

252.32 

23  <  0.0001 

—  5.02  x  10_7(pafi.i  -  276)  +  +  6.76  X  10-8 (pafi.i  -  426)  +  -  0.369  adlsc  +  0.0409  adlsc2 
+0.0394resp  -  9.11  X  10_5(resp  -  10)  +  +  0.000176(resp  -  24)  +  -  8.5  X  10-5 (resp  -  39)  + 


and  [c 
otherwise. 


1  if  subject  is  in  group  c,  0  otherwise; 


x  if  x  >  0,  0 


#  Derive  S  functions  that  express  mean  and  quantiles 

#  of  survival  time  for  specific  linear  predictors 

#  analytically 

expected. surv  Mean(f) 

quantile . surv  Quantile(f) 

latex (expected. surv,  f ile  =  '  '  ,  type=' S input  '  ) 


expe ct ed . surv 

L 

function  (lp  =  NULL, 

parms  =  0.802352037606488) 

1 

names (parms ) 

<r-  NULL 

exp (lp  +  exp(2  *  parms)/2) 

} 

latex ( quantile .surv,  f ile  =  '  '  , 


type  = 


Sinput  '  ) 


quantile . surv 


function 


(q  =  0 .5  ,  lp  =  NULL  , 
parms  =  0.802352037606488) 
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{ 

names (parms)  V-  NULL 

f  V-  function (lp ,  q,  parms)  lp  +  exp(parms)  *  qnorm(q) 
names (q)  V-  format (q) 

drop ( exp ( out er ( lp ,  q,  FUN  =  f,  parms  =  parms))) 

} 


median. surv 

V-  function(x)  quantile . surv (lp=x) 

L 

# 

Improve 

variable  labels  f 

or 

the 

nomogram 

1 

f  . 

approx 

V-  Newlabels ( f . app 

r  ox 

,  c 

(  '  Disease 

Group  ' 

9 

'Mean  Arterial  BP 

!  t 

9 

Cre 

at inine  '  , 

' Age  '  ,  ' 

Heart  Rate  '  , 

'SUPPORT  Coma  Sco 

re  ' 

,  'P 

a02 /  (  .01* 

Fi02  )  '  , 

' ADL  '  , 

' Re sp .  Rat  e  '  ) ) 

nom  <— 

nomogram (f . approx  , 

paf i . i  =  c  (0  ,  50  , 

100 

,  200 ,  300  , 

500,  600,  700,  800, 

900)  , 

fun  =  list (  'Median 

Survi 

val  T ime  ' 

=medi an 

. surv  , 

'Mean  Survi 

val 

Time  ' 

expe  c t  e 

d . surv )  , 

fun . at  =  c  (  .  1  ,  .  25  , 

.5, 

1,2 

,5,10, 20  , 

40)) 

Pi 

ot (nom  , 

cex. var=l ,  cex.ax 

i  s  = 

.75 

,  lmgp=  .  25 ) 

# 

Figure 

19.12 

19.6  Problems 

Analyze  the  Mayo  Clinic  PBC  dataset. 

1.  Graphically  assess  whether  Weibull  (extreme  value),  exponential,  log- 
logistic,  or  log-normal  distributions  will  fit  the  data,  using  a  few  apparently 
important  stratification  factors. 

2.  For  the  best  fitting  parametric  model  from  among  the  four  examined, 
fit  a  model  containing  several  sensible  covariables,  both  categorical  and 
continuous.  Do  a  Wald  test  for  whether  each  factor  in  the  model  has  an 
association  with  survival  time,  and  a  likelihood  ratio  test  for  the  simulta¬ 
neous  contribution  of  all  predictors.  For  classification  factors  having  more 
than  two  levels,  be  sure  that  the  Wald  test  has  the  appropriate  degrees 
of  freedom.  For  continuous  factors,  verify  or  relax  linearity  assumptions. 
If  using  a  Weibull  model,  test  whether  a  simpler  exponential  model  would 
be  appropriate.  Interpret  all  estimated  coefficients  in  the  model.  Write  the 
full  survival  model  in  mathematical  form.  Generate  a  predicted  survival 
curve  for  a  patient  with  a  given  set  of  characteristics. 

See  [361]  for  an  analysis  of  this  dataset  using  linear  splines  in  time  and  in  the 

covariables. 
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Fig.  19.12  Nomogram  for  predicting  median  and  mean  survival  time,  based  on  ap¬ 
proximation  of  full  model 


Chapter  20 

Cox  Proportional  Hazards  Regression 
Model 


20.1  Model 


20.1.1  Preliminaries 


The  Cox  proportional  hazards  model  is  the  most  popular  model  for  the 
analysis  of  survival  data.  It  is  a  semiparametric  model;  it  makes  a  parametric 
assumption  concerning  the  effect  of  the  predictors  on  the  hazard  function, 
but  makes  no  assumption  regarding  the  nature  of  the  hazard  function  A (t) 
itself.  The  Cox  PH  model  assumes  that  predictors  act  multiplicatively  on  the 
hazard  function  but  does  not  assume  that  the  hazard  function  is  constant  (i.e., 
exponential  model),  Weibull,  or  any  other  particular  form.  The  regression 
portion  of  the  model  is  fully  parametric;  that  is,  the  regressors  are  linearly 
related  to  log  hazard  or  log  cumulative  hazard.  In  many  situations,  either 
the  form  of  the  true  hazard  function  is  unknown  or  it  is  complex,  so  the 
Cox  model  has  definite  advantages.  Also,  one  is  usually  more  interested  in 
the  effects  of  the  predictors  than  in  the  shape  of  A (£),  and  the  Cox  approach 
allows  the  analyst  to  essentially  ignore  A (£),  which  is  often  not  of  primary 
interest. 

The  Cox  PH  model  uses  only  the  rank  ordering  of  the  failure  and  censoring 
times  and  thus  is  less  affected  by  outliers  in  the  failure  times  than  fully 
parametric  methods.  The  model  contains  as  a  special  case  the  popular  log- 
rank  test  for  comparing  survival  of  two  groups.  For  estimating  and  testing 
regression  coefficients,  the  Cox  model  is  as  efficient  as  parametric  models 
(e.g.,  Weibull  model  with  PH)  even  when  all  assumptions  of  the  parametric 
model  are  satisfied.171 

When  a  parametric  model’s  assumptions  are  not  true  (e.g.,  when  a  Weibull 
model  is  used  and  the  population  is  not  from  a  Weibull  survival  distribution 
so  that  the  choice  of  model  is  incorrect),  the  Cox  analysis  is  more  efficient 
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than  the  parametric  analysis.  As  shown  below,  diagnostics  for  checking  Cox 
model  assumptions  are  very  well  developed. 


20.1.2  Model  Definition 

The  Cox  PH  model  is  most  often  stated  in  terms  of  the  hazard  function: 

X(t\X)  =  X(t)  exp(X/3).  (20.1) 

We  do  not  include  an  intercept  parameter  in  Xf3  here.  Note  that  this  is 
identical  to  the  parametric  PH  model  stated  earlier.  There  is  an  important 
difference,  however,  in  that  now  we  do  not  assume  any  specific  shape  for  X(t). 
For  the  moment,  we  are  not  even  interested  in  estimating  X(t).  The  reason 
for  this  departure  from  the  fully  parametric  approach  is  due  to  an  ingenious 
conditional  argument  by  Cox.132  Cox  argued  that  when  the  PH  model  holds, 
information  about  X(t)  is  not  very  useful  in  estimating  the  parameters  of 
primary  interest,  fj .  By  special  conditioning  in  formulating  the  log  likelihood 
function,  Cox  showed  how  to  derive  a  valid  estimate  of  f3  that  does  not  require 
estimation  of  A (t)  as  A (t)  dropped  out  of  the  new  likelihood  function.  Cox’s 
derivation  focuses  on  using  the  information  in  the  data  that  relates  to  the 
relative  hazard  function  exp (X/3). 


20. 1 . 3  Estimation  of  [3 


Cox’s  derivation  of  an  estimator  of  [3  can  be  loosely  described  as  follows.  Let 
t\  <  £2  <  •  •  •  <  tk  represent  the  unique  ordered  failure  times  in  the  sample  of 
n  subjects;  assume  for  now  that  there  are  no  tied  failure  times  (tied  censoring 
times  are  allowed)  so  that  k  =  n.  Consider  the  set  of  individuals  at  risk  of 
failing  an  instant  before  failure  time  T.  This  set  of  individuals  is  called  the 
risk  set  at  time  C,  and  we  use  Rj  to  denote  this  risk  set.  Ri  is  the  set  of 
subjects  j  such  that  the  subject  had  not  failed  or  been  censored  by  time  C; 
that  is,  the  risk  set  Ri  includes  subjects  with  failure/censoring  time  Yj  >U. 

The  conditional  probability  that  individual  i  is  the  one  that  failed  at  C, 
given  that  the  subjects  in  the  set  Ri  are  at  risk  of  failing,  and  given  further 
that  exactly  one  failure  occurs  at  C,  is 


Probjsubject  i  fails  at  U\Ri  and  one  failure  at  U}  = 

Probjsubject  i  fails  at  ti\Ri} 


Prob{  one  failure  at  ti\Ri} 


(20.2) 


using  the  rules  of  conditional  probability.  This  conditional  probability  equals 


X(tj)  exp(Xj(3)  _  exp (Xj/3)  _  exp (Xj/3) 

J2jeRi  M*)  exP (xjP)  T,jeRi  exP (xjP)  T,Yj>ti  exP (xjP) 


(20.3) 
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independent  of  A (t).  To  understand  this  likelihood,  consider  a  special  case 
where  the  predictors  have  no  effect;  that  is,  j3  =  0  [93,  pp.  48-49].  Then 
exp (Xi/3)  =  exp (Xj/3)  =  1  and  Prob{subject  i  is  the  subject  that  failed  at 
U\Ri  and  one  failure  occurred  at  ti}  is  1  /ni  where  rq  is  the  number  of  subjects 
at  risk  at  time  ti. 

By  arguing  that  these  conditional  probabilities  are  themselves  condition¬ 
ally  independent  across  the  different  failure  times,  a  total  likelihood  can  be 
computed  by  multiplying  these  individual  likelihoods  over  all  failure  times. 
Cox  termed  this  a  partial  likelihood  for  j3 : 


m 


n 

Yi  uncensored 


exp  (Xj/3) 

Yyj  >Yi  exP (xjP) ' 


(20.4) 


The  log  partial  likelihood  is 


log  L((3)  =  ^2  {V/3  -  log[  22  exp(Xj^)]}.  (20.5) 

Yi  uncensored  Yj>Yi 

Cox  and  others  have  shown  that  this  partial  log  likelihood  can  be  treated  as 
an  ordinary  log  likelihood  to  derive  valid  (partial)  MLEs  of  f3.  Note  that  this 
log  likelihood  is  unaffected  by  the  addition  of  a  constant  to  any  or  all  of  the 
Xs.  This  is  consistent  with  the  fact  that  an  intercept  term  is  unnecessary  and 
cannot  be  estimated  since  the  Cox  model  is  a  model  for  the  relative  hazard 
and  does  not  directly  estimate  the  underlying  hazard  A (t). 

When  there  are  tied  failure  times  in  the  sample,  the  true  partial  log  likeli¬ 
hood  function  involves  permutations  so  it  can  be  time-consuming  to  compute. 
When  the  number  of  ties  is  not  large,  Breslow'  has  derived  a  satisfactory 
approximate  log  likelihood  function.  The  formula  given  above,  when  applied 
without  modification  to  samples  containing  ties,  actually  uses  Breslow’s  ap¬ 
proximation.  If  there  are  ties  so  that  k  <  n  and  £1, . . . ,  tk  denote  the  unique 
failure  times  as  we  originally  intended,  Breslow’s  approximation  is  written  as 


k 

log  L(P)  =  ^{SiP  -  di  log[  22  exp(XjP)}},  (20.6) 

i— 1  Yj>ti 


where  Si  =  Di  is  the  set  of  indexes  j  for  subjects  failing  at  time 

ti ,  and  di  is  the  number  of  failures  at  C. 

Efron  ^  1  derived  another  approximation  to  the  true  likelihood  that  is  sig¬ 
nificantly  more  accurate  than  the  Breslow  approximation  and  often  yields 
estimates  that  are  very  close  to  those  from  the  more  cumbersome  permuta¬ 
tion  likelihood:288 


k  di 

log  l(p)  =  ^{SiP  -  ypog[  E  exp  (xjP) 

i=l  j= 1  Yj>ti 


j  ~  1 

di 


22  exp(XiP)]}. 

ItzDi 


(20.7) 
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In  the  special  case  when  all  tied  failure  times  are  from  subjects  with  iden¬ 
tical  Xi/3,  the  Efron  approximation  yields  the  exact  (permutation)  marginal 
likelihood  (Therneau,  personal  communication,  1993). 

Kalbfleisch  and  Prentice330  showed  that  Cox’s  partial  likelihood,  in  the 
absence  of  predictors  that  are  functions  of  time,  is  a  marginal  distribution  of 
the  ranks  of  the  failure/censoring  times. 

See  Therneau  and  Grambsch  4  and  Huang  and  Harrington  for  descrip¬ 
tions  of  penalized  partial  likelihood  estimation  methods  for  improving  mean 
squared  error  of  estimates  of  /3  in  a  similar  fashion  to  what  was  discussed  in 
Section  9.10. 


20.1.4  Model  Assumptions  and  Interpretation 
of  Parameters 

The  Cox  PH  regression  model  has  the  same  assumptions  as  the  parametric 
PH  model  except  that  no  assumption  is  made  regarding  the  shape  of  the 
underlying  hazard  or  survival  functions  A (t)  and  S(t).  The  Cox  PH  model 
assumes,  in  its  most  basic  form,  linearity  and  additivity  of  the  predictors 
with  respect  to  log  hazard  or  log  cumulative  hazard.  It  also  assumes  the  PH 
assumption  of  no  time  by  predictor  interactions;  that  is,  the  predictors  have 
the  same  effect  on  the  hazard  function  at  all  values  of  t.  The  relative  hazard 
function  exp (X/3)  is  constant  through  time  and  the  survival  functions  for 
subjects  with  different  values  of  X  are  powers  of  each  other.  If,  for  example, 
the  hazard  of  death  at  time  t  for  treated  patients  is  half  that  of  control 
patients  at  time  £,  this  same  hazard  ratio  is  in  effect  at  any  other  time  point. 
In  other  words,  treated  patients  have  a  consistently  better  hazard  of  death 
over  all  follow-up  time. 

The  regression  parameters  are  interpreted  the  same  as  in  the  parametric 
PH  model.  The  only  difference  is  the  absence  of  hazard  shape  parameters 
in  the  model,  since  the  hazard  shape  is  not  estimated  in  the  Cox  partial 
likelihood  procedure. 


20.1.5  Example 

Consider  again  the  rat  vaginal  cancer  data  from  Section  18.3.6.  Figure  20.1 
displays  the  nonparametric  survival  estimates  for  the  two  groups  along  with 
estimates  derived  from  the  Cox  model  (by  a  method  discussed  later). 

1 

require ( rms ) 
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group  V-  c (rep (  'Group  1  '  ,  19)  ,rep(  'Group  2', 21)) 
group  V-  f act or ( group ) 

dd  V-  dat adi st ( group ) ;  opt i ons ( dat adi s t =  '  dd  '  ) 

days  V- 

c  (143 , 164 , 188 , 188 , 190  ,  192 ,206  ,209 ,213,216  ,220 ,227 ,230  , 

234 ,246 ,265 ,304 ,216 ,244 , 142 ,156,163,198 ,205 ,232 ,232  , 

233 ,233 ,233 ,233 ,239 ,240 ,261  ,280 ,280 ,296 ,296 ,323 ,204 ,344) 
death  V-  rep(l,40) 
death  [c  (  18 , 19 , 39 , 40)  ]  V-  0 
units  (  days  )  V-  '  Day  ' 

df  V-  dat  a  .  f  r  ame  (  days  ,  death,  group) 

S  V-  Surv(days ,  death) 

f  V-  npsurv (S  ~  group,  type =  '  f leming  '  ) 

for  (meth  in  c(  'exact',  'breslow  '  ,  '  efron  '  ))  { 

g  V-  cph(S  ~  group,  method =meth  ,  surv  =  TRUE  ,  x  =  TRUE  ,  y  =  TRUE) 
#  print(g)  to  see  results 

} 

f.exp  V-  psm(S  group,  dist  = 'exponential') 
fw  V-  psm(S  ~  group,  dist =' weibull  '  ) 

phf  orm  V-  pphsm ( f w ) 


co  V-  gray(c(0,  .8)) 

survplot (f ,  lty=c(l,  1),  lwd=c(l,  3),  col=co, 
label . curves  =  FALSE ,  conf = 'none  '  ) 
survplot (g,  lty=c(3,  3),  lwd=c(l,  3),  col=co,  #  Efron  approx. 

add  =  TRUE ,  1 abe 1 . curve s  =  FALSE ,  conf . type =' none  '  ) 

legend(c(2,  160),  c  (  .  38  ,  .54), 

c(  ' Nonparametr i c  Estimates  '  ,  ' Cox-Breslow  Estimates  ')  , 

lty=c(l,  3),  cex=.8,  bty='n') 

legend(c(2,  160),  c  (  .  18 ,  .34),  cex=.8, 

c (' Group  1',  'Group  2'),  lwd  =  c(l,3),  col  =  co,  bty='n') 

The  predicted  survival  curves  from  the  fitted  Cox  model  are  in  good  agree¬ 
ment  with  the  nonparametric  estimates,  again  verifying  the  PH  assumption 
for  these  data.  The  estimates  of  the  group  effect  from  a  Cox  model  (using  the 
exact  likelihood  since  there  are  ties,  along  with  both  Efron’s  and  Breslow’s 
approximations)  as  well  as  from  a  Weibull  model  and  an  exponential  model 
are  shown  in  Table  20.1.  The  exponential  model,  with  its  constant  hazard, 
cannot  accommodate  the  long  early  period  with  no  failures.  The  group  pre¬ 
dictor  was  coded  as  X\  —  0  and  X\  =  1  for  Groups  1  and  2,  respectively.  For 
this  example,  the  Breslow  likelihood  approximation  resulted  in  (3  closer  to 
that  from  maximizing  the  exact  likelihood.  Note  how  the  group  effect  (47% 
reduction  in  hazard  of  death  by  the  exact  Cox  model)  is  underestimated  by 
the  exponential  model  (9%  reduction  in  hazard).  The  hazard  ratio  from  the 
Weibull  fit  agrees  with  the  Cox  fit. 
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Fig.  20.1  Altschuler-Nelson-Fleming-Harrington  nonparametric  survival  estimates 
and  Cox-Breslow  estimates  for  rat  data508 


Table  20.1  Group  effects  using  three  versions  of  the  partial  likelihood  and  three 
parametric  models 


Model 

Group  Regression 
Coefficient 

S.E.  Wald  Group  2:1 

P- Value  Hazard  Ratio 

Cox  (Exact) 

-0.629 

0.361 

0.08 

0.533 

Cox  (Efron) 

-0.569 

0.347 

0.10 

0.566 

Cox  (Breslow) 

-0.596 

0.348 

0.09 

0.551 

Exponential 

-0.093 

0.334 

0.78 

0.911 

Weibull  (AFT) 

0.132 

0.061 

0.03 

— 

Weibull  (PH) 

-0.721 

— 

— 

0.486 

20.1.6  Design  Formulations 

Designs  are  no  different  for  the  Cox  PH  model  than  for  other  models  except 
for  one  minor  distinction.  Since  the  Cox  model  does  not  have  an  intercept 
parameter,  the  group  omitted  from  X  in  an  ANOVA  model  will  go  into  the 
underlying  hazard  function.  As  an  example,  consider  a  three-group  model  for 
treatments  A,  B,  and  C.  We  use  the  two  dummy  variables 

X\  =  1  if  treatment  is  A,  0  otherwise,  and 
X2  =  1  if  treatment  is  B,  0  otherwise. 
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The  parameter  fii  is  the  A  :  C  log  hazard  ratio  or  difference  in  hazards  at 
any  time  t  between  treatment  A  and  treatment  C.  @2  is  the  B  :  C  log  hazard 
ratio  (exp(/?2)  is  the  B  :  C  hazard  ratio,  etc.).  Since  there  is  no  intercept 
parameter,  there  is  no  direct  estimate  of  the  hazard  function  for  treatment 
C  or  any  other  treatment;  only  relative  hazards  are  modeled. 

As  with  all  regression  models,  a  Wald,  score,  or  likelihood  ratio  test  for 
differences  between  any  treatments  is  conducted  by  testing  Hq  :  fti  =  @2  =  Q 
with  2  d.f. 


20.1.7  Extending  the  Model  by  Stratification 

A  unique  feature  of  the  Cox  PH  model  is  its  ability  to  adjust  for  factors  that 
are  not  modeled.  Such  factors  usually  take  the  form  of  polytomous  stratifi¬ 
cation  factors  that  either  are  too  difficult  to  model  or  do  not  satisfy  the  PH 
assumption.  For  example,  a  subject’s  occupation  or  clinical  study  site  may 
take  on  dozens  of  levels  and  the  sample  size  may  not  be  large  enough  to 
model  this  nominal  variable  with  dozens  of  dummy  variables.  Also,  one  may 
know  that  a  certain  predictor  (either  a  polytomous  one  or  a  continuous  one 
that  is  grouped)  may  not  satisfy  PH  and  it  may  be  too  complex  to  model  the 
hazard  ratio  for  that  predictor  as  a  function  of  time. 

The  idea  behind  the  stratified  Cox  PH  model  is  to  allow  the  form  of  the 
underlying  hazard  function  to  vary  across  levels  of  the  stratification  factors. 
A  stratified  Cox  analysis  ranks  the  failure  times  separately  within  strata. 
Suppose  that  there  are  b  strata  indexed  by  j  =  1,  2, . . . ,  b.  Let  C  denote  the 
stratum  identification.  For  example,  C  =  1  or  2  may  stand  for  the  female  and 
male  strata,  respectively.  The  stratified  PH  model  is 

A(£|X,  C  =  j)  =  Xj(t)  exp(X/?),  or 

S(t\X,C  =  j )  =  Si(t)exp(JC/3).  (20.8) 

Here  A j(t)  and  S3  (t)  are,  respectively,  the  underlying  hazard  and  survival 
functions  for  the  j th  stratum.  The  model  does  not  assume  any  connection 
between  the  shapes  of  these  functions  for  different  strata. 

In  this  stratified  analysis,  the  data  are  stratified  by  C  but,  by  default,  a 
common  vector  of  regression  coefficients  is  fitted  across  strata.  These  common 
regression  coefficients  can  be  thought  of  as  “pooled”  estimates.  For  example, 
a  Cox  model  with  age  as  a  (modeled)  predictor  and  sex  as  a  stratification 
variable  essentially  estimates  the  common  slope  of  age  by  pooling  information 
about  the  age  effect  over  the  two  sexes.  The  effect  of  age  is  adjusted  by  sex 
differences,  but  no  assumption  is  made  about  how  sex  affects  survival.  There 
is  no  PH  assumption  for  sex.  Levels  of  the  stratification  factor  C  can  represent 
multiple  stratification  factors  that  are  cross-classified.  Since  these  factors  are 
not  modeled,  no  assumption  is  made  regarding  interactions  among  them. 
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At  first  glance  it  appears  that  stratification  causes  a  loss  of  efficiency. 
However,  in  most  cases  the  loss  is  small  as  long  as  the  number  of  strata  is  not 
too  large  with  regard  to  the  total  number  of  events.  A  stratum  that  contains 
no  events  contributes  no  information  to  the  analysis,  so  such  a  situation 
should  be  avoided  if  possible. 

The  stratified  or  “pooled”  Cox  model  is  fitted  by  formulating  a  separate 
log  likelihood  function  for  each  stratum,  but  with  each  log  likelihood  having  a 
common  j3  vector.  If  different  strata  are  made  up  of  independent  subjects,  the 
strata  are  independent  and  the  likelihood  functions  are  multiplied  together 
to  form  a  joint  likelihood  over  strata.  Log  likelihood  functions  are  thus  added 
over  strata.  This  total  log  likelihood  function  is  maximized  once  to  derive  a 
pooled  or  stratified  estimate  of  f3  and  to  make  an  inference  about  /3.  No  infer¬ 
ence  can  be  made  about  the  stratification  factors.  They  are  merely  “adjusted 
for.” 

Stratification  is  useful  for  checking  the  PH  and  linearity  assumptions  for 
one  or  more  predictors.  Predicted  Cox  survival  curves  (Section  20.2)  can 
be  derived  by  modeling  the  predictors  in  the  usual  way,  and  then  stratified 
survival  curves  can  be  estimated  by  using  those  predictors  as  stratification 
factors.  Other  factors  for  which  PH  is  assumed  can  be  modeled  in  both  in¬ 
stances.  By  comparing  the  modeled  versus  stratified  survival  estimates,  a 
graphical  check  of  the  assumptions  can  be  made.  Figure  20.1  demonstrates 
this  method  although  there  are  no  other  factors  being  adjusted  for  and  strat¬ 
ified  Cox  estimates  are  KM  estimates.  The  stratified  survival  estimates  are 
derived  by  stratifying  the  dataset  to  obtain  a  separate  underlying  survival 
curve  for  each  stratum,  while  pooling  information  across  strata  to  estimate 
coefficients  of  factors  that  are  modeled. 

Besides  allowing  a  factor  to  be  adjusted  for  without  modeling  its  effect, 
a  stratified  Cox  PH  model  can  also  allow  a  modeled  factor  to  interact  with 
strata.143, 180,603  For  the  age-sex  example,  consider  the  following  model  with 
X\  denoting  age  and  C  —  1,2  denoting  females  and  males,  respectively. 

X(t\XuC  =  1)  =  Ai(t)  exp(PiXi) 

X(t\ XUC  =  2)  =  A 2(t)  exp(/?i Ax  +  p2X1).  (20.9) 

This  model  can  be  simplified  to 

X{t\XuC  =  j )  =  A  j(t)  exp(/?i*i  +  p2X2)  (20.10) 

if  X2  is  a  product  interaction  term  equal  to  0  for  females  and  X\  for  males. 
The  p2  parameter  quantifies  the  interaction  between  age  and  sex:  it  is  the 
difference  in  the  age  slope  between  males  and  females.  Thus  the  interaction 
between  age  and  sex  can  be  quantified  and  tested,  even  though  the  effect  of 
sex  is  not  modeled! 

The  stratified  Cox  model  is  commonly  used  to  adjust  for  hospital  differ¬ 
ences  in  a  multicenter  randomized  trial.  With  this  method,  one  can  allow 
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for  differences  in  outcome  between  q  hospitals  without  estimating  q  —  1  pa¬ 
rameters.  Treatment  x  hospital  interactions  can  be  tested  efficiently  without 
computational  problems  by  estimating  only  the  treatment  main  effect,  after 
stratifying  on  hospital.  The  score  statistic  (with  q  —  1  d.f.)  for  testing  q  —  1 
treatment  x  hospital  interaction  terms  is  then  computed  (“residual  y2”  in  a 
stepwise  procedure  with  treatment  x  hospital  terms  as  candidate  predictors). 

The  stratified  Cox  model  turns  out  to  be  a  generalization  of  the  condi¬ 
tional  logistic  model  for  analyzing  matched  set  (e.g.,  case-control)  data/1 
Each  stratum  represents  a  set,  and  the  number  of  “failures”  in  the  set  is  the 
number  of  “cases”  in  that  set.  For  r  :  1  matching  (r  may  vary  across  sets),  the 
Breslow'1  likelihood  may  be  used  to  fit  the  conditional  logistic  model  exactly. 
For  r  :  m  matching,  an  exact  Cox  likelihood  must  be  computed. 


20.2  Estimation  of  Survival  Probability  and  Secondary 
Parameters 

As  discussed  above,  once  a  partial  log  likelihood  function  is  derived,  it  is 
used  as  if  it  were  an  ordinary  log  likelihood  function  to  estimate  /3,  estimate 
standard  errors  of  /3,  obtain  confidence  limits,  and  make  statistical  tests.  Point 
and  interval  estimates  of  hazard  ratios  are  obtained  in  the  same  fashion  as 
with  parametric  PH  models  discussed  earlier. 

The  Cox  model  and  parametric  survival  models  differ  markedly  in  how  one 
estimates  S(t\X).  Since  the  Cox  model  does  not  depend  on  a  choice  of  the 
underlying  survival  function  S'(t),  fitting  a  Cox  model  does  not  result  directly 
in  an  estimate  of  S(t\X).  However,  several  authors  have  derived  secondary 
estimates  of  S(t\X).  One  method  is  the  discrete  hazard  model  of  Kalbfleisch 
and  Prentice  [331,  pp.  36-37,  84-87].  Their  estimator  has  two  advantages:  it 
is  an  extension  of  the  Kaplan-Meier  estimator  and  is  identical  to  Skm  if  the 
estimated  value  of  (3  happened  to  be  zero  or  there  are  no  covariables  being 
modeled;  and  it  is  not  affected  by  the  choice  of  what  constitutes  a  “standard” 
subject  having  the  underlying  survival  function  S(t).  In  other  words,  it  would 
not  matter  whether  the  standard  subject  is  one  having  age  equal  to  the  mean 
age  in  the  sample  or  the  median  age  in  the  sample;  the  estimate  of  S(t \X) 
as  a  function  of  X  =  age  would  be  the  same  (this  is  also  true  of  another 
estimator  which  follows). 

Let  H,  £2,  •  •  • ,  tk  denote  the  unique  failure  times  in  the  sample.  The  discrete 
hazard  model  assumes  that  the  probability  of  failure  is  greater  than  zero  only 
at  observed  failure  times.  The  probability  of  failure  at  time  t3  given  that  the 
subject  has  not  failed  before  that  time  is  also  the  hazard  of  failure  at  time 
tj  since  the  model  is  discrete.  The  hazard  at  t3  for  the  standard  subject  is 
written  X3.  Letting  ay  =  1  —  Ay ,  the  underlying  survival  function  can  be 
written 
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i —  1 

S(ti)  =  '[[aj,i  =  l,2,...,k  (a0  =  1).  (20.11) 

j=o 

A  separate  equation  can  be  solved  using  the  Newton-Raphson  method  to 
estimate  each  ay.  If  there  is  only  one  failure  at  time  A,  there  is  a  closed- form 
solution  for  the  maximum  likelihood  estimate  of  cq,  cq,  letting  j  denote  the 
subject  who  failed  at  U.  /3  denotes  the  partial  MLE  of  /T 


—  exp(Xjf3)/  ^2  exp(Xmf3) 


exp  (-Xjj3) 


Y  >Y 


(20.12) 


If  /?  =  0,  this  formula  reduces  to  a  conditional  probability  component  of  the 
product-limit  estimator,  1  —  (1/number  at  risk). 

The  estimator  of  the  underlying  survival  function  is 

S(t)  =  ]d  &j,  (20.13) 

and  the  estimate  of  the  probability  of  survival  past  time  t  for  a  subject  with 
predictor  values  X  is 

S(t\X)  =  5(pexp(X/§).  (20.14) 

When  the  model  is  stratified,  estimation  of  the  ay  and  S  is  carried  out  sep- 
arately  within  each  stratum  once  f3  is  obtained  by  pooling  over  strata.  The 
stratified  survival  function  estimates  can  be  thought  of  as  stratified  Kaplan- 
Meier  estimates  adjusted  for  A,  with  the  adjustment  made  by  assuming  PH 
and  linearity.  As  mentioned  previously,  these  stratified  adjusted  survival  es¬ 
timates  are  useful  for  checking  model  assumptions  and  for  providing  a  simple 
way  to  incorporate  factors  that  violate  PH. 

The  stratified  estimates  are  also  useful  in  themselves  as  descriptive  statis¬ 
tics  without  making  assumptions  about  a  major  factor.  For  example,  in  a 
study  from  Califf  et  al.88  to  compare  medical  therapy  with  coronary  artery 
bypass  grafting  (CABG),  the  model  was  stratified  by  treatment  but  adjusted 
for  a  variety  of  baseline  characteristics  by  modeling.  These  adjusted  survival 
estimates  do  not  assume  a  form  for  the  effect  of  surgery.  Figure  20.2  displays 
unadjusted  (Kaplan-Meier)  and  adjusted  survival  curves,  with  baseline  pre¬ 
dictors  adjusted  to  their  mean  levels  in  the  combined  sample.  Notice  that 
valid  adjusted  survival  estimates  are  obtained  even  though  the  curves  cross 
(i.e.,  PH  is  violated  for  the  treatment  variable).  These  curves  are  essentially 
product  limit  estimates  with  respect  to  treatment  and  Cox  PH  estimates  with 
respect  to  the  baseline  descriptor  variables. 

The  Kalbfleisch-Prentice  discrete  underlying  hazard  model  estimates  of 
the  dj  are  one  minus  estimates  of  the  hazard  function  at  the  discrete  failure 
times.  However,  these  estimated  hazard  functions  are  usually  too  “noisy”  to 
be  useful  unless  the  sample  size  is  very  large  or  the  failure  times  have  been 
grouped  (say  by  rounding). 
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Fig.  20.2  Unadjusted  (Kaplan-Meier)  and  adjusted  (Cox-Kalbfleisch-Prentice)  es¬ 
timates  of  survival.  Left,  Kaplan-Meier  estimates  for  patients  treated  medically  and 
surgically  at  Duke  University  Medical  Center  from  November  1969  through  December 
1984.  These  survival  curves  are  not  adjusted  for  baseline  prognostic  factors.  Right, 
survival  curves  for  patients  treated  medically  or  surgically  after  adjusting  for  all 
known  important  baseline  prognostic  characteristics.88 


Just  as  Kalbfleisch  and  Prentice  have  generalized  the  Kaplan-Meier  es¬ 
timator  to  allow  for  covariables,  Breslow'  has  generalized  the  Altschuler- 
Nelson-Aalen-Fleming-Harrington  estimator  to  allow  for  covariables.  Using 
the  notation  in  Section  20.1.3,  Breslow’s  estimate  is  derived  through  an  esti¬ 
mate  of  the  cumulative  hazard  function: 

Mt)  =  E  v — ~  ,r  (2(U5) 

i:U<t  22vi>u  exPW) 

For  any  A,  the  estimates  of  A  and  S  are 

A(t\X)  =  A(t)  exp(X j3) 

S(t\X)  =  exp[—A(t)  exp(X/3)].  (20.16) 

More  asymptotic  theory  has  been  derived  from  the  Breslow  estimator  than 
for  the  Kalbfleisch-Prentice  estimator.  Another  advantage  of  the  Breslow 
estimator  is  that  it  does  not  require  iterative  computations  for  d\  >  1.  Law¬ 
less  [382,  p.  362]  states  that  the  two  survival  function  estimators  differ  little 
except  in  the  right-hand  tail  when  all  <As  are  unity.  Like  the  Kalbfleisch- 
Prentice  estimator,  the  Breslow  estimator  is  invariant  under  different  choices 
of  “standard  subjects”  for  the  underlying  survival  S(t). 

Somewhat  complex  formulas  are  available  for  computing  confidence  limits 
of  S(t\X).615 
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20.3  Sample  Size  Considerations 

One  way  of  estimating  the  minimum  sample  size  for  a  Cox  model  analy¬ 
sis  aimed  at  estimating  survival  probabilities  is  to  consider  the  simplest  case 
where  there  are  no  covariates.  Thus  the  problem  reduces  to  using  the  Kaplan- 
Meier  estimate  to  estimate  S(t).  Let’s  further  simplify  things  to  assume  there 
is  no  censoring.  Then  the  Kaplan-Meier  estimate  is  just  one  minus  the  em¬ 
pirical  cumulative  distribution  function.  By  the  Dvoretzky-Kiefer-Wolfowitz 
inequality,  the  maximum  absolute  error  in  an  empirical  distribution  function 
estimate  of  the  true  continuous  distribution  function  is  less  than  or  equal  to 
e  with  probability  of  at  least  1  —  2e~2ne2.  For  the  probability  to  be  at  least 
0.95,  n  =  184.  Thus  in  the  case  of  no  censoring,  one  needs  184  subjects  to 
estimate  the  survival  curve  to  within  a  margin  of  error  of  0.1  everywhere. 
To  estimate  the  subject-specific  survival  curves  (S(t \X))  will  require  greater 
sample  sizes,  as  will  having  censored  data.  It  is  a  fair  approximation  to  think 
of  184  as  the  needed  number  of  subjects  suffering  the  event  or  being  censored 
“late.” 

Turning  to  estimation  of  a  hazard  ratio  for  a  single  binary  predictor  X 
that  has  equal  numbers  of  X  =  0  and  X  =  1,  if  the  total  sample  size  is  n 
and  the  number  of  events  in  the  two  categories  are  respectively  eo  and  ei, 
the  variance  of  the  log  hazard  ratio  is  approximately  v  —  ~  Letting  z 

denote  the  1  —  a/2  standard  normal  critical  value,  the  multiplicative  margin 
of  error  (MMOE)  with  confidence  1  —  a  is  given  by  exp (zy/v).  To  achieve 

a  MMOE  of  1.2  in  estimating  with  equal  numbers  of  events  in  the  two 
groups  and  a  =  0.05  requires  a  total  of  462  events. 


20.4  Test  Statistics 

Wald,  score,  and  likelihood  ratio  statistics  are  useful  and  valid  for  drawing 
inferences  about  [3  in  the  Cox  model.  The  score  test  deserves  special  mention 
here.  If  there  is  a  single  binary  predictor  in  the  model  that  describes  two 
groups,  the  score  test  for  assessing  the  importance  of  the  binary  predictor 
is  virtually  identical  to  the  Mantel-Haenszel  log-rank  test  for  comparing  the 
two  groups.  If  the  analysis  is  stratified  for  other  (nonmodeled)  factors,  the 
score  test  from  a  stratified  Cox  model  is  equivalent  to  the  corresponding 
stratified  log-rank  test.  Of  course,  the  likelihood  ratio  or  Wald  tests  could 
also  be  used  in  this  situation,  and  in  fact  the  likelihood  ratio  test  may  be 
better  than  the  score  test  (i.e.,  type  I  errors  by  treating  the  likelihood  ratio 
test  statistic  as  having  a  y2  distribution  may  be  more  accurate  than  using 
the  log-rank  statistic). 

The  Cox  model  can  be  thought  of  as  a  generalization  of  the  log-rank  pro¬ 
cedure  since  it  allows  one  to  test  continuous  predictors,  perform  simultaneous 


20.6  Assessment  of  Model  Fit 


487 


tests  of  various  predictors,  and  adjust  for  other  continuous  factors  without 
grouping  them.  Although  a  stratified  log-rank  test  does  not  make  assump¬ 
tions  regarding  the  effect  of  the  adjustment  (stratifying)  factors,  it  makes  the 
same  assumption  (i.e. ,  PH)  as  the  Cox  model  regarding  the  treatment  effect 
for  the  statistical  test  of  no  difference  in  survival  between  groups. 


20.5  Residuals 


Therneau  et  al.605  discussed  four  types  of  residuals  from  the  Cox  model: 
martingale,  score,  Schoenfeld,  and  deviance.  The  first  three  have  been  proven 
to  be  very  useful,  as  indicated  in  Table  20.2. 


Table  20.2  Types  of  residuals  for  the  Cox  model 

Residual  Purposes 

Martingale  Assessing  adequacy  of  a  hypothesized  predictor 
transformation.  Graphing  an  estimate  of  a 
predictor  transformation  (Section  20.6.1). 

Score  Detecting  overly  influential  observations 
(Section  20.9).  Robust  estimate  of 
covariance  matrix  of  /?  (Section  9. 5). 410 
Schoenfeld  Testing  PH  assumption  (Section  20.6.2). 

Graphing  estimate  of  hazard  ratio  function 
(Section  20.6.2). 


20.6  Assessment  of  Model  Fit 

As  stated  before,  the  Cox  model  makes  the  same  assumptions  as  the  para¬ 
metric  PH  model  except  that  it  does  not  assume  a  given  shape  for  A (t)  or 
S(t).  Because  the  Cox  PH  model  is  so  widely  used,  methods  of  assessing  its  fit 
are  dealt  with  in  more  detail  than  was  done  with  the  parametric  PH  models. 


20.6.1  Regression  Assumptions 

Regression  assumptions  (linearity,  additivity)  for  the  PH  model  are  displayed 

in  Figures  18.3  and  18.5.  As  mentioned  earlier,  the  regression  assumptions  can 

/\ 

be  verified  by  stratifying  by  X  and  examining  \og  A{t\X)  or  log  [Akm(£  1-^0] 
estimates  as  a  function  of  X  at  fixed  time  t.  However,  as  was  pointed  out 
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in  logistic  regression,  the  stratification  method  is  prone  to  problems  of  high 
variability  of  estimates.  The  sample  size  must  be  moderately  large  before 
estimates  are  precise  enough  to  observe  trends  through  the  “noise.”  If  one 
wished  to  divide  the  sample  by  quintiles  of  age  and  15  events  were  thought  to 
be  needed  in  each  stratum  to  derive  a  reliable  estimate  of  log[ylKM(2  years)], 
there  would  need  to  be  75  events  in  the  entire  sample.  If  the  Kaplan-Meier 
estimates  were  needed  to  be  adjusted  for  another  factor  that  was  binary,  twice 
as  many  events  would  be  needed  to  allow  the  sample  to  be  stratified  by  that 
factor. 

Figure  20.3  displays  Kaplan-Meier  three-year  log  cumulative  hazard  esti¬ 
mates  stratified  by  sex  and  decile  of  age.  The  simulated  sample  consists  of 
2000  hypothetical  subjects  (389  of  whom  had  events),  with  1174  males  (146 
deaths)  and  826  females  (243  deaths).  The  sample  was  drawn  from  a  pop¬ 
ulation  with  a  known  survival  distribution  that  is  exponential  with  hazard 
function 

X(t\XuX2)  =  .02exp[.8Xi  +  M(X2  -  50)],  (20.17) 

where  X\  represents  the  sex  group  (0  =  male,  1  =  female)  and  X2  age  in 
years,  and  censoring  is  uniform.  Thus  for  this  population  PH,  linearity,  and 
additivity  hold.  Notice  the  amount  of  variability  and  wide  confidence  limits 
in  the  stratified  nonparametric  survival  estimates. 


n  V- 

2000 

1 

set. 

seed  (3) 

age 

V-  50  + 

12  *  rnorm 

(n) 

labe 

1  ( age ) 

V-  '  Age  ' 

sex 

V-  f  act 

or  ( 1  +  ( runif  (n) 

< 

.4),  1:2,  c (  ' Male  '  ,  'Female')) 

cens 

<-  15 

*  runif (n) 

h  <- 

.02  * 

exp ( . 04  * 

( age  - 

50) 

+  .8  *  (sex  ==  'Female  ')) 

ft  V-  -log  ( 

runif (n) ) 

/ 

h 

e  V- 

if  else 

(ft  <  cens 

1, 

0) 

print (table 

(e)) 

e 

0  1 
1611  389 


ft  V-  pmin(ft,  cens) 
units(ft)  V-  'Year' 

Srv  V-  Surv(ft,  e) 

age. dec  V-  cut2(age,  g  =  10,  levels . mean =TRUE) 
label ( age . dec )  V-  'Age' 

dd  V-  dat adi st  ( age  ,  sex,  age. dec);  opt i ons ( dat adi s t =  '  dd  '  ) 
f  .  np  <—  cph(Srv  ~  st r at ( age . de c )  +  strat(sex),  surv  =  TRUE) 

#  surv=TRUE  speeds  up  computations ,  and  confidence  limits  when 

#  there  are  no  covariables  are  still  accurate. 

p  V-  Pr edi ct  ( f  .  np  ,  age. dec,  sex,  time=3,  loglog  =  TRUE) 

#  Treat  age. dec  as  a  numeric  variable  (means  within  deciles) 

p$age.dec  V-  as  .numeric  (as  .  character  (p$age  .dec  )  ) 
ggplot(p,  ylim  =  c  (-5  ,  -.5)) 
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sex 


Male 

Female 


Fig.  20.3  Kaplan-Meier  log  A  estimates  by  sex  and  deciles  of  age,  with  0.95  confi¬ 
dence  limits.  Solid  line  is  for  males,  dashed  line  for  females. 


As  with  the  logistic  model  and  other  regression  models,  the  restricted  cubic 
spline  function  is  an  excellent  tool  for  modeling  the  regression  relationship 
with  very  few  assumptions.  A  four-knot  spline  Cox  PH  model  in  two  variables 
(Xi,  X2)  that  assumes  linearity  in  X\  and  no  X\  x  X2  interaction  is  given  by 

\(t\X)  =  A  (t)  exp(/31X1  +  I32X2  +  foX'2  +  faX%), 

=  A  (t)  exp(/31X1  +  f(X2)),  (20.18) 

where  X'2  and  X2  are  spline  component  variables  as  described  earlier  and 
f(X 2)  is  the  spline  function  or  spline  transformation  of  X2  given  by 

f(X2)  =  P2X2  +  hx'2  +  (20.19) 

In  linear  form  the  Cox  model  without  assuming  linearity  in  X2  is 

log  X(t\X)  =  log  A (t)  +  P1X1  +  f(X2).  (20.20) 

By  computing  partial  MLEs  of  f32,  ^3,  and  /34,  one  obtains  the  estimated 
transformation  of  X2  that  yields  linearity  in  log  hazard  or  log  cumulative 
hazard. 

A  similar  model  that  does  not  assume  PH  in  X\  is  the  Cox  model  stratified 
on  X\.  Letting  the  stratification  factor  be  C  =  Xi,  this  model  is 
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log \(t\X2,  C  =  j)  =  log  A j(t)  +  PiX2  +  foX'2  +  /?3x" 

=  logA  j{t)+f(X2).  (20.21) 

This  model  does  assume  no  X\  x  X 2  interaction. 

Figure  20.4  displays  the  estimated  spline  function  relating  age  and  sex  to 
log[yl(3)]  in  the  simulated  dataset,  using  the  additive  model  stratified  on  sex. 

f.noia  V-  cph(Srv  ~  rcs(age,4)  +  strat(sex),  x=TRUE ,  y=TRUE) 

#  Get  accurate  C.L.  for  any  age  by  specifying  x=TRUE  y=TRUE 

#  Note:  for  evaluating  shape  of  regression,  we  would  not 

#  ordinarily  bother  to  get  3-year  survival  probabilities  - 

#  would  just  use  X  *  beta 

#  We  do  so  here  to  use  same  scale  as  nonp arametri c  estimates 
w  latex ( f . noia ,  inline =TRUE ,  digits=3) 

latex (anova(f.noia)  ,  table.env  =  FALSE  ,  file  =  '  '  ) 


x2 

d.f.  P 

age 

72.33 

3  <  0.0001 

Nonlinear 

0.69 

2  0.7067 

TOTAL 

72.33 

3  <  0.0001 

p  Predict  ( f  .  noia  ,  age,  sex,  time=3,  loglog  =  TRUE) 
ggplot(p,  ylim  =  c ( -5  ,  -.5)) 


sex 


Male 

Female 


Fig.  20.4  Cox  PH  model  stratified  on  sex,  using  spline  function  for  age,  no  inter¬ 
action.  0.95  confidence  limits  also  shown.  Solid  line  is  for  males,  dashed  line  is  for 
females. 
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A  formal  test  of  the  linearity  assumption  of  the  Cox  PH  model  in  the 
above  example  is  obtained  by  testing  Hq  :  /32  =  ^3  =  0.  The  y2  statistic  with 
2  d.f.  is  0.69,  P  =  0.7.  The  fitted  equation,  after  simplifying  the  restricted 
cubic  spline  to  simpler  (unrestricted)  form,  is  X/3  =  —1.46  +  0.0255age  + 
2.59  x  10_5(age  — 30.3)3  — 0.000101(age  — 45.1)^_  +  9.73  x  10_5(age  — 54.6)+ — 
2.22  x  10_5(age  —  69.6)  +  .  Notice  that  the  spline  estimates  are  closer  to  the 
true  linear  relationships  than  were  the  Kaplan-Meier  estimates,  and  the  con¬ 
fidence  limits  are  much  tighter.  The  spline  estimates  impose  a  smoothness 
on  the  relationship  and  also  use  more  information  from  the  data  by  treating 
age  as  a  continuous  ordered  variable.  Also,  unlike  the  stratified  Kaplan-Meier 
estimates,  the  modeled  estimates  can  make  the  assumption  of  no  age  x  sex 
interaction.  When  this  assumption  is  true,  modeling  effectively  boosts  the 
sample  size  in  estimating  a  common  function  for  age  across  both  sex  groups. 
Of  course,  this  assumption  can  be  tested  and  interactions  can  be  modeled  if 
necessary. 

A  Cox  model  that  still  does  not  assume  PH  for  X\  =  C  but  which  allows 
for  an  X\  x  X2  interaction  is 

log \{t\X2,C  =  j )  =  log  Xj  (t)  +  ftl2  +  p2X'2  +  fcX'i 

+  fS±X\X2  +  kXxX'2  (20.22) 

+  PeX1X!'. 

This  model  allows  the  relationship  between  X2  and  log  hazard  to  be  a  smooth 
nonlinear  function  and  the  shape  of  the  X2  effect  to  be  completely  different 
for  each  level  of  X\  if  X\  is  dichotomous.  Figure  20.5  displays  a  fit  of  this 
model  at  t  =  3  years  for  the  simulated  dataset. 


f  .  ia  cph(Srv  ~  rcs(age,4)  *  strat(sex) 

surv  =  TRUE ) 

w  lat ex ( f . ia ,  ini ine =TRUE ,  digits=3) 

lat ex ( anova ( f . ia ) ,  table . env=FALSE ,  file= 

,  x  =  ' 

’  0 

rRUE  , 

y  =  TRUE  , 

x2 

d.f. 

P 

age  (Factor+Higher  Order  Factors) 

72.82 

6  <  0.0001 

All  Interactions 

1.05 

3 

0.7886 

Nonlinear  (Factor+Higher  Order  Factors) 

1.80 

4 

0.7728 

age  x  sex  (Factor+Higher  Order  Factors) 

1.05 

3 

0.7886 

Nonlinear 

1.05 

2 

0.5911 

Nonlinear  Interaction  :  f (A, B)  vs.  AB 

1.05 

2 

0.5911 

TOTAL  NONLINEAR 

1.80 

4 

0.7728 

TOTAL  NONLINEAR  +  INTERACTION 

1.80 

5 

0.8763 

TOTAL 

72.82 

6  < 

0.0001 

p  Pr edi ct ( f . ia ,  age,  sex,  time=3,  loglog =TRUE ) 
ggplot(p,  ylim=c ( -5 ,  -.5)) 

-~rzzzr~  v_ ti:w:  ...  
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Age 

Fig.  20.5  Cox  PH  model  stratified  on  sex,  with  interaction  between  age  spline  and 
sex.  0.95  confidence  limits  are  also  shown.  Solid  line  is  for  males,  dashed  line  for 
females. 


The  fitted  equation  is  X/3  =  — 1.8  + 0.0493age— 2.15  x  10  6(age  —  30.3) 


+ 

3 


2.82  x  10_5(age  — 45.1)+  +  5.18  x  10_5(age— 54.6)+  — 2.15  x  10_5(age  — 69.6)  +  + 
[Female] [— 0.0366age  +  4.29  x  10-5(age  —  30.3)  +  —  0.00011(age  —  45.1)  +  + 
6.74  x  10_5(age  —  54.6)+  —  2.32  x  10-/(age  —  69.6)+].  The  test  for  interaction 
yielded  y2  =  1.05  with  3  d.f.,  P  =  0.8.  The  simultaneous  test  for  linearity 
and  additivity  yielded  y2  =  1.8  with  5  d.f.,  P  =  0.9.  Note  that  allowing  the 
model  to  be  very  flexible  (not  assuming  linearity  in  age,  additivity  between 
age  and  sex,  and  PH  for  sex)  still  resulted  in  estimated  regression  functions 
that  are  very  close  to  the  true  functions.  However,  confidence  limits  in  this 
unrestricted  model  are  much  wider. 

Figure  20.6  displays  the  estimated  relationship  between  left  ventricular 
ejection  fraction  (LVEF)  and  log  hazard  ratio  for  cardiovascular  death  in  a 
sample  of  patients  with  significant  coronary  artery  disease.  The  relationship 
is  estimated  using  three  knots  placed  at  quantiles  0.05,  0.5,  and  0.95  of  LVEF. 
Here  there  is  significant  nonlinearity  (Wald  y2  =  9.6  with  1  d.f.).  The  graphs 
leads  to  a  transformation  of  LVEF  that  better  satisfies  the  linearity  assump¬ 
tion:  min(LVEF,  0.5).  This  transformation  has  the  best  log  likelihood  “for  the 
money”  as  judged  by  the  Akaike  information  criterion  (AIC  =  —2  log  L.R. 
— 2x  no.  parameters  =  127).  The  AICs  for  3,  4,  5,  and  6-knot  spline  fits  were, 
respectively,  126,  124,  122,  and  120. 

Had  the  suggested  transformation  been  more  complicated  than  a  trunca¬ 
tion,  a  tentative  transformation  could  have  been  checked  for  adequacy  by 
expanding  the  new  transformed  variable  into  a  new  spline  function  and  test¬ 
ing  it  for  linearity. 
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Fig.  20.6  Restricted  cubic  spline  estimate  of  relationship  between  LVEF  and  relative 
log  hazard  from  a  sample  of  979  patients  and  198  cardiovascular  deaths.  Data  from 
the  Duke  Cardiovascular  Disease  Databank. 


Other  methods  based  on  smoothed  residual  plots  are  also  valuable  tools 
for  selecting  predictor  transformations.  Therneau  et  al.605  describe  residuals 
based  on  martingale  theory  that  can  estimate  transformations  of  any  number 
of  predictors  omitted  from  a  Cox  model  fit,  after  adjusting  for  other  vari¬ 
ables  included  in  the  fit.  Figure  20.7  used  various  smoothing  methods  on  the 
points  (LVEF,  residual).  First,  the  R  loess  function  was  used  to  obtain  a 
smoothed  scatterplot  fit  and  approximate  0.95  confidence  bars.  Second,  an 
ordinary  least  squares  model,  representing  LVEF  as  a  restricted  cubic  spline 
with  five  default  knots,  was  fitted.  Ideally,  both  fits  should  have  used  weighted 
regression  as  the  residuals  do  not  have  equal  variance.  Predicted  values  from 
this  fit  along  with  0.95  confidence  limits  are  shown.  The  loess  and  spline- 
linear  regression  agree  extremely  well.  Third,  Cleveland’s  lowess  scatterplot 
smoother111  was  used  on  the  martingale  residuals  against  LVEF.  The  sug¬ 
gested  transformation  from  all  three  is  very  similar  to  that  of  Figure  20.6.  For 
smaller  sample  sizes,  the  raw  residuals  should  also  be  displayed.  There  is  one 
vector  of  martingale  residuals  that  is  plotted  against  all  of  the  predictors. 
When  correlations  among  predictors  are  mild,  plots  of  estimated  predictor 
transformations  without  adjustment  for  other  predictors  (i.e.,  marginal  trans¬ 
formations)  may  be  useful.  Martingale  residuals  may  be  obtained  quickly  by 
fixing  f3  =  0  for  all  predictors.  Then  smoothed  plots  of  predictor  against 
residual  may  be  made  for  all  predictors.  Table  20.3  summarizes  some  of  the 
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LVEF 

Fig.  20.7  Three  smoothed  estimates  relating  martingale  residuals605  to  LVEF. 

Table  20.3  Uses  of  martingale  residuals  for  estimating  predictor  transformations 

Purpose  Method 

Estimate  transformation  for  Force  (3 1  =  0  and  compute 

a  single  variable  residuals  from  the  null  regression 

- x - 

Check  linearity  assumption  for  Compute  and  compute 
a  single  variable  residuals  from  the  linear  regression 

- x - x - 

Estimate  marginal  Force  pi , . . . ,  Pp  =  0  and  compute 

transformations  for  p  variables  residuals  from  the  global  null  model 

Estimate  transformation  for  Estimate  p  —  1  /3s,  forcing  pi  =  0 
variable  i  adjusted  for  other  Compute  residuals  from  mixed 
p  —  1  variables  global/null  model 


ways  martingale  residuals  may  be  used.  See  section  10.5  for  more  information 
on  checking  the  regression  assumptions.  The  methods  for  examining  interac¬ 
tion  surfaces  described  there  apply  without  modification  to  the  Cox  model 
(except  that  the  nonparametric  regression  surface  does  not  apply  because  of 
censoring) . 


20.6.2  Proportional  Hazards  Assumption 

Even  though  assessment  of  fit  of  the  regression  part  of  the  Cox  PH  model 
corresponds  with  other  regression  models  such  as  the  logistic  model,  the  Cox 
model  has  its  own  distributional  assumption  in  need  of  validation.  Here,  of 
course,  the  distributional  assumption  is  not  as  stringent  as  with  other  survival 
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models,  but  we  do  need  to  validate  how  the  survival  or  hazard  functions 
for  various  subjects  are  connected.  There  are  many  graphical  and  analyti¬ 
cal  methods  of  verifying  the  PH  assumption.  Two  of  the  methods  have  al¬ 
ready  been  discussed:  a  graphical  examination  of  parallelism  of  log  A  plots, 
and  a  comparison  of  stratified  with  unstratified  models  (as  in  Figure  20.1). 
Muenz6/  suggested  a  simple  modification  that  will  make  nonproportional 
hazards  more  apparent:  plot  AkMi  (£)/Akm2  (t)  against  t  and  check  for  flat¬ 
ness.  The  points  on  this  curve  can  be  passed  through  a  smoother.  One  can  also 
plot  differences  in  log (— log  S(t))  against  t. 1  Arjas29  developed  a  graphical 
method  based  on  plotting  the  estimated  cumulative  hazard  versus  the  cumu¬ 
lative  number  of  events  in  a  stratum  as  t  progresses. 

There  are  other  methods  for  assessing  whether  PH  holds  that  may  be  more 
direct.  Gore  et  ah, 226  Harrell  and  Lee,266  and  Kay340  (see  also  Anderson  and 
Senthilselvan  )  describe  a  method  for  allowing  the  log  hazard  ratio  (Cox 
regression  coefficient)  for  a  predictor  to  be  a  function  of  time  by  fitting  spe¬ 
cially  stratified  Cox  models.  Their  method  assumes  that  the  predictor  being 
examined  for  PH  already  satisfies  the  linear  regression  assumption.  Follow¬ 
up  time  is  stratified  into  intervals  and  a  separate  model  is  fitted  to  compute 
the  regression  coefficient  within  each  interval,  assuming  that  the  effect  of  the 
predictor  is  constant  only  within  that  small  interval.  It  is  recommended  that 
intervals  be  constructed  so  that  there  is  roughly  an  equal  number  of  events 
in  each.  The  number  of  intervals  should  allow  at  least  10  or  20  events  per 
interval. 

The  interval-specific  log  hazard  ratio  is  estimated  by  excluding  all  subjects 
with  event /censoring  time  before  the  start  of  the  interval  and  censoring  all 
events  that  occur  after  the  end  of  the  interval.  This  process  is  repeated  for 
all  desired  time  intervals.  By  plotting  the  log  hazard  ratio  and  its  confidence 
limits  versus  the  interval,  one  can  assess  the  importance  of  a  predictor  as 
a  function  of  follow-up  time  and  learn  how  to  model  non-PH  using  more 
complicated  models  containing  predictor  by  time  interactions.  If  the  hazard 
ratio  is  approximately  constant  within  broad  time  intervals,  the  time  strat¬ 
ification  method  can  be  used  for  fitting  and  testing  the  predictor  x  time 
interaction  [266,  p.  827];  [98]. 

Consider  as  an  example  the  rat  vaginal  cancer  data  used  in  Figures  18.9, 
18.10,  and  20.1.  Recall  that  the  PH  assumption  appeared  to  be  satisfied  for 
the  two  groups  although  Figure  18.9  demonstrated  some  non-Weibullness. 
Figure  20.8  contains  a  A  ratio  plot.467 

f  V-  cph(S  rsj  strat  ( group )  ,  surv  =  TRUE) 

#  For  both  strata ,  eval.  S(t)  at  combined  set  of  death  times 

times  V-  sort ( unique ( days [ death  ==  1])) 

est  V-  survest  (f  ,  data  .  frame  (  group  =  levels  (group  ))  , 

times=times ,  conf . type =" none ")$ surv 
cumhaz  V-  -  log(est) 

plot (times  ,  cumhaz  [2  ,  ]  /  cumhaz  [1,4  ,  xlab  =  " Day s  "  , 

ylab = " Cumulat i ve  Hazard  Ratio",  type="s") 
abline  (h  =  l ,  col  =  gray ( . 80  )  ) 
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Fig.  20.8  Estimate  of  A2/A1  based  on  —log  of  Altschuler-Nelson-Fleming— 
Harrington  nonparametric  survival  estimates. 

Table  20.4  Interval-specific  group  effects  from  rat  data  by  artificial  censoring 

Time  Observations  Deaths  Log  Hazard  Standard 
Interval  Ratio  Error 


[0, 209) 

40 

12 

-0.47 

0.59 

[209, 234) 

27 

12 

-0.72 

0.58 

234  + 

14 

12 

-0.50 

0.64 

hazar d . r at i o . pi ot ( g$x ,  g$y,  e=12,  pr=TRUE) 

The  number  of  observations  is  declining  over  time  because  computations  in 
each  interval  were  based  on  animals  followed  at  least  to  the  start  of  that 
interval.  The  overall  Cox  regression  coefficient  was  —0.57  with  a  standard 
error  of  0.35.  There  does  not  appear  to  be  any  trend  in  the  hazard  ratio  over 
time,  indicating  a  constant  hazard  ratio  or  proportional  hazards  (Table  20.4). 

Now  consider  the  Veterans  Administration  Lung  Cancer  dataset  [331,  pp. 
60,  223-4].  Log  A  plots  indicated  that  the  four  cell  types  did  not  satisfy 
PH.  To  simplify  the  problem,  omit  patients  with  “large”  cell  type  and  let 
the  binary  predictor  be  1  if  the  cell  type  is  “squamous”  and  0  if  it  is  “small” 
or  “adeno.”  We  are  assessing  whether  survival  patterns  for  the  two  groups 
“squamous”  versus  “small”  or  “adeno”  have  PH.  Interval-specific  estimates  of 
the  squamous  :  small, adeno  log  hazard  ratios  (using  Efron’s  likelihood)  are 
found  in  Table  20.5.  Times  are  in  days. 
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Table  20.5  Interval-specific  effects  of  squamous  cell  cancer  in  VA  lung  cancer  data 


Time  Observations  Deaths  Log  Hazard  Standard 
Interval  Ratio  Error 


[0,21) 

110 

26 

-0.46 

0.47 

[21,  52) 

84 

26 

-0.90 

0.50 

[52,118) 

59 

26 

-1.35 

0.50 

118  + 

28 

26 

-1.04 

0.45 

Table  20.6  Interval-specific  effects  of  performance  status  in  VA  lung  cancer  data 


Time  Observations  Deaths  Log  Hazard  Standard 
Interval  Ratio  Error 


[0, 19] 

137 

27 

-0.053 

0.010 

[19,49) 

112 

26 

-0.047 

0.009 

[49, 99) 

85 

27 

-0.036 

0.012 

99  + 

28 

26 

-0.012 

0.014 

getHdata ( valung ) 
with (valung  ,  { 

hazard . rat i o . pi ot  ( 1  *  (cell  ==  'Squamous'),  Surv(t,  dead), 

e  =25 ,  subset=cell  !=  'Large  '  , 
pr =TRUE ,  pi =FALSE ) 

hazar d . r at i o . pi ot  ( 1  *  kps,  Surv  (t  ,  dead),  e  =  25  , 

pr  =  TRUE  ,  pi  =  FALSE  )  }) 

There  is  evidence  of  a  trend  of  a  decreasing  hazard  ratio  over  time  which 
is  consistent  with  the  observation  that  squamous  cell  patients  had  equal  or 
worse  survival  in  the  early  period  but  decidedly  better  survival  in  the  late 
phase. 

From  the  same  dataset  now  examine  the  PH  assumption  for  Karnofsky 
performance  status  using  data  from  all  subjects,  if  the  linearity  assumption  is 
satisfied.  Interval-specific  regression  coefficients  for  this  predictor  are  given  in 
Table  20.6.  There  is  good  evidence  that  the  importance  of  performance  status 
is  decreasing  over  time  and  that  it  is  not  a  prognostic  factor  after  roughly 
99  days.  In  other  words,  once  a  patient  survives  99  days,  the  performance 
status  does  not  contain  much  information  concerning  whether  the  patient  will 
survive  120  days.  This  non-PH  would  be  more  difficult  to  detect  from  Kaplan- 
Meier  plots  stratified  on  performance  status  unless  performance  status  was 
stratified  carefully. 

Figure  20.9  displays  a  log  hazard  ratio  plot  for  a  larger  dataset  in  which 
more  time  strata  can  be  formed.  In  3299  patients  with  coronary  artery  disease, 
827  suffered  cardiovascular  death  or  nonfatal  myocardial  infarction.  Time 
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Predictor:Pain/Ischemia  Index 
Event:cdeathmi 


Fig.  20.9  Stratified  hazard  ratios  for  pain/ischemia  index  over  time.  Data  from  the 
Duke  Cardiovascular  Disease  Databank. 


was  stratified  into  intervals  containing  approximately  30  events,  and  within 
each  interval  the  Cox  regression  coefficient  for  an  index  of  anginal  pain  and 
ischemia  was  estimated.  The  pain/ischemia  index,  one  component  of  which  is 
unstable  angina,  is  seen  to  have  a  strong  effect  for  only  six  months.  After  that, 
survivors  have  stabilized  and  knowledge  of  the  angina  status  in  the  previous 
six  months  is  not  informative. 

Another  method  for  graphically  assessing  the  log  hazard  ratio  over  time  is 
based  on  Schoenfeld’s  partial  residuals503,55 7  with  respect  to  each  predictor  in 
the  fitted  model.  The  residual  is  the  contribution  of  the  first  derivative  of  the 
log  likelihood  function  with  respect  to  the  predictor’s  regression  coefficient, 
computed  separately  at  each  risk  set  or  unique  failure  time.  In  Figure  20.10 
the  “loess-smoothed”96  (with  approximate  0.95  confidence  bars)  and  “super- 
smoothed”20^  relationship  between  the  residual  and  unique  failure  time  is 
shown  for  the  same  data  as  Figure  20.9.  For  smaller  n,  the  raw  residuals 
should  also  be  displayed  to  convey  the  proper  sense  of  variability.  The  agree¬ 
ment  with  the  pattern  in  Figure  20.9  is  evident. 

Pettitt  and  Bin  Daud503  suggest  scaling  the  partial  residuals  by  the  infor¬ 
mation  matrix  components.  They  also  propose  a  score  test  for  PH  based  on 
the  Schoenfeld  residuals.  Grambsch  and  Therneau  found  that  the  Pettitt- 
Bin  Daud  standardization  is  sometimes  misleading  in  that  non-PH  in  one 
variable  may  cause  the  residual  plot  for  another  variable  to  display  non- 
PH.  The  Grambsch-Therneau  weighted  residual  solves  this  problem  and  also 
yields  a  residual  that  is  on  the  same  scale  as  the  log  relative  hazard  ratio. 
Their  residual  is 


P  +  dRV, 


(20.23) 
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Fig.  20.10  Smoothed  weighted233  Schoenfeld557  residuals  for  the  same  data  in  Fig¬ 
ure  20.9.  Test  for  PH  based  on  the  correlation  (p)  between  the  individual  weighted 
Schoenfeld  residuals  and  the  rank  of  failure  time  yielded  p  =  —0.23,  z  =  —6.73,  P  = 
2  x  10-11. 


where  d  is  the  total  number  of  events,  R  is  the  n  x  p  matrix  of  Schoenfeld 

A  /V 

residuals,  and  V  is  the  estimated  covariance  matrix  for  p.  This  new  residual 
can  also  be  the  basis  for  tests  for  PH,  by  correlating  a  user-specified  function 
of  unique  failure  times  with  the  weighted  residuals. 

The  residual  plot  is  computationally  very  attractive  since  the  score  residual 
components  are  byproducts  of  Cox  maximum  likelihood  estimation.  Another 
attractive  feature  is  the  lack  of  need  to  categorize  the  time  axis.  Unless  ap¬ 
proximate  confidence  intervals  are  derived  from  smoothing  techniques,  a  lack 
of  confidence  intervals  from  most  software  is  one  disadvantage  of  the  method. 

Formal  tests  for  PH  can  be  based  on  time-stratified  Cox  regression  esti¬ 
mates.27, 266  Alternatively,  more  complex  (and  probably  more  efficient)  formal 
tests  for  PH  can  be  derived  by  specifying  a  form  for  the  time  by  predictor  in¬ 
teraction  (using  what  is  called  a  time-dependent  covariable  in  the  Cox  model) 
and  testing  coefficients  of  such  interactions  for  significance.  The  obsolete  Ver¬ 
sion  5  SAS  phglm  procedure  used  a  computationally  fast  procedure  based  on 
an  approximate  score  statistic  that  tests  for  linear  correlation  between  the 
rank  order  of  the  failure  times  in  the  sample  and  Schoenfeld’s  partial  resid¬ 
uals.258,266  This  test  is  available  in  R  (for  both  weighted  and  unweighted 
residuals)  using  Therneau’s  cox.zph  function  in  the  survival  package.  For  the 
results  in  Figure  20.10,  the  test  for  PH  is  highly  significant  (correlation  coef¬ 
ficient  =  —0.23,  normal  deviate  z  =  —6.73).  Since  there  is  only  one  regression 
parameter,  the  weighted  residuals  are  a  constant  multiple  of  the  unweighted 
ones,  and  have  the  same  correlation  coefficient. 
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Table  20.7  Time-specific  hazard  ratio  estimates  of  squamous  cell  cancer  effect  in  VA 
lung  cancer  data,  by  fitting  two  Weibull  distributions  with  unequal  shape  parameters 


t 

log  Hazard 

Ratio 

10 

-0.36 

36 

-0.64 

83.5 

-0.83 

200 

-1.02 

12 


Another  method  for  checking  the  PH  assumption  which  is  especially  ap¬ 
plicable  to  a  polytomous  predictor  involves  taking  ratios  of  parametrically 
estimated  hazard  functions  estimated  separately  for  each  level  of  the  predic¬ 
tor.  For  example,  suppose  that  a  risk  factor  X  is  either  present  (X  =  1)  or 
absent  ( X  =  0),  and  suppose  that  separate  Weibull  distributions  adequately 
fit  the  survival  pattern  of  each  group.  If  there  are  no  other  predictors  to  ad¬ 
just  for,  define  the  hazard  function  for  X  =  0  as  cry t7-1  and  the  hazard  for 
X  =  1  as  SOt 6,-1 .  The  X  =  1  :  X  =  0  hazard  ratio  is 


<W7  1  =  7_0 

set0-1  se 


(20.24) 


The  hazard  ratio  is  constant  if  the  two  Weibull  shape  parameters  (7  and  6) 
are  equal.  These  Weibull  parameters  can  be  estimated  separately  and  a  Wald 
test  statistic  of  Hq  :  7  =  0  can  be  computed  by  dividing  the  square  of  their 
difference  by  the  sum  of  the  squares  of  their  estimated  standard  errors,  or 
better  by  a  likelihood  ratio  test.  A  plot  of  the  estimate  of  the  hazard  ratio 
above  as  a  function  of  t  may  also  be  informative. 

In  the  VA  lung  cancer  data,  the  MLEs  of  the  Weibull  shape  parameters 
for  squamous  cell  cancer  is  0.77  and  for  the  combined  small  +  adeno  is  0.99. 
Estimates  of  the  reciprocals  of  these  parameters,  provided  by  some  software 
packages,  are  1.293  and  1.012  with  respective  standard  errors  of  0.183  and 
0.0912.  A  Wald  test  for  differences  in  these  reciprocals  provides  a  rough  test 
for  a  difference  in  the  shape  estimates.  The  Wald  y2  is  1.89  with  1  d.f.  indi¬ 
cating  slight  evidence  for  non-PH. 

The  fitted  Weibull  hazard  function  for  squamous  cell  cancer  is  .0167t°'23 
and  for  adeno  +  small  is  0.0144t_°  01.  The  estimated  hazard  ratio  is  then 
1.16£-0'22  and  the  log  hazard  ratio  is  0.148  —  0.22  log  £.  By  evaluating  this 
Weibull  log  hazard  ratio  at  interval  midpoints  (arbitrarily  using  t  =  200 
for  the  last  (open)  interval)  we  obtain  log  hazard  ratios  that  are  in  good 
agreement  with  those  obtained  by  time-stratifying  the  Cox  model  (Table  20.5) 
as  shown  in  Table  20.7. 

There  are  many  methods  of  assessing  PH  using  time-dependent  covari¬ 
ables  in  the  Cox  model.  26,583  Gray237,238  mentions  a  flexible  and  efficient 
method  of  estimating  the  hazard  ratio  function  using  time-dependent  covari¬ 
ables  that  are  X  x  spline  term  interactions.  Gray’s  method  uses  B-splines  and 
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requires  one  to  maximize  a  penalized  log-likelihood  function.  Verweij  and  van 
Houwelingen  11  developed  a  more  nonpar ametric  version  of  this  approach. 
Hess281  uses  simple  restricted  cubic  splines  to  model  the  time-dependent  co¬ 
variable  effects  (see  also  [4,287,398,498]).  Suppose  that  k  =  4  knots  are  used 
and  that  a  covariable  X  is  already  transformed  correctly.  The  model  is 

log  A(t|X)  =  log  A (t)  +  frX  +  faXt  +  faXt'  +  faXt",  (20.25) 

where  tf,t"  are  constructed  spline  variables  (Equation  2.25).  The  X  +  1  :  X 
log  hazard  ratio  function  is  estimated  by 


T  fist  +  fd^t  .  (20.26) 

This  method  can  be  generalized  to  allow  for  simultaneous  estimation  of  the 
shape  of  the  X  effect  and  X  x  t  interaction  using  spline  surfaces  in  (X,  t) 
instead  of  (Xi,X2)  (Section  2.7.2). 

Table  20.8  summarizes  many  facets  of  verifying  assumptions  for  PH  mod¬ 
els.  The  trade-offs  of  the  various  methods  for  assessing  proportional  hazards 
are  given  in  Table  20.9. 
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20.7  What  to  Do  When  PH  Fails 

When  a  factor  violates  the  PH  assumption  and  a  test  of  association  is  not 
needed,  the  factor  can  be  adjusted  for  through  stratification  as  mentioned 
earlier.  This  is  especially  attractive  if  the  factor  is  categorical.  For  continuous 
predictors,  one  may  want  to  stratify  into  quantile  groups.  The  continuous 
version  of  the  predictor  can  still  be  adjusted  for  as  a  covariable  to  account 
for  any  residual  linearity  within  strata. 

When  a  test  of  significance  is  needed  and  the  P-value  is  impressive,  the 
“principle  of  conservatism”  could  be  invoked,  as  the  P-value  would  likely 
have  been  more  impressive  had  the  factor  been  modeled  correctly.  Predicted 
survival  probabilities  using  this  approach  will  be  erroneous  in  certain  time 
intervals. 

An  efficient  test  of  association  can  be  done  using  time-dependent  covari¬ 
ables  [444,  pp.  208-217].  For  example,  in  the  model 

A(t|X)  =  Ao (t)  exp(/3iX  +  P2X  x  log(t  +  1))  (20.27) 

one  tests  Hq  :  /?i  =  fa  =  0  with  2  d.f.  This  is  similar  to  the  approach  used 
by  [72].  Stratification  on  time  intervals  can  also  be  used:27,226,266 


X(t\X)  =  A0 (t)  exp(/31X  +  p2X  x  [t  >  c\). 


(20.28) 
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Table  20.8  Assumptions  of  the  Proportional  Hazards  Model 

Variables  Assumptions  Verification 


Response  Variable  T 
Time  Until  Event 


Shape  of  X(t\X)  for  fixed  X 
as  1 1 
Cox:  none 
Weibull:  te 


Shape  of  SkmW 


Interaction  Between  X 
and  T 


Proportional  hazards — effect 
of  X  does  not  depend  on  T 
(e.g.,  treatment  effect  is  con¬ 
stant  over  time) 


•  Categorical  X: 

check  parallelism  of  strati¬ 
fied  log[—  log  S(t)]  plots  as 
1 1 

•  Muenz467  cum.  hazard  ra¬ 
tio  plots 

•  Arjas29  cum.  hazard  plots 

•  Check  agreement  of  strati¬ 
fied  and  modeled  estimates 

•  Hazard  ratio  plots 

•  Smoothed  Schoenfeld  resid¬ 
ual  plots  and  correlation 
test  (time  vs.  residual) 

•  Test  time-dependent  co¬ 
variable  such  as  A  x  log(£  + 

i) 

•  Ratio  of  parametrically  es¬ 
timated  A  (t) 


Individual  Predictors  X 


Shape  of  X(t\X)  for  fixed  t  as 

X  t 

Linear: 

log  X(t\X)  =  log  A (t)  +  (3X 
Nonlinear:  logA(t|A)  = 
log  A(t)  +  f(X) 


•  A:- level  ordinal 

X  :  linear 

term  +  k  —  2  dummy  vari- 

ables 

•  Continuous  X: 

polynom- 

ials,  spline 

functions, 

smoothed 
residual  plots 

martingale 

Interaction  Between  X\ 
and  X2 


Additive  effects:  effect  of  X\ 
on  log  A  is  independent  of  X2 
and  vice  versa 


Test  nonadditive  terms  (e.g., 
products) 


If  this  step-function  model  holds,  and  if  a  sufficient  number  of  subjects  have 
late  follow-up,  you  can  also  fit  a  model  for  early  outcomes  and  a  separate 
one  for  late  outcomes  using  interval-specific  censoring  as  discussed  in  Section 
20.6.2.  The  dual  model  approach  provides  easy  to  interpret  models,  assuming 
that  proportional  hazards  is  satisfied  within  each  interval. 

Kronborg  and  Aaby36/  and  Dabrowska  et  al.143  provide  tests  for  differences 
in  A(t)  at  specific  t  based  on  stratified  PH  models.  These  can  also  be  used 
to  test  for  treatment  effects  when  PH  is  violated  for  treatment  but  not  for 
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adjustment  variables.  Differences  in  mean  restricted  life  length  (differences  in 
areas  under  survival  curves  up  to  a  fixed  finite  time)  can  also  be  useful  for 
comparing  therapies  when  PH  fails.335 


Table  20.9  Comparison  of  methods  for  checking  the  proportional  hazards  assump¬ 
tion  and  for  allowing  for  non-proportional  hazards 


Method 

Requires 

Requires 

Computa- 

Yields 

Yields 

Requires 

Must  Choose 

Grouping 

Grouping 

tional 

Formal 

Estimate  of 

Fitting  2 

Smoothing 

X 

t 

Efficiency 

Test 

A2p)/Ai(£) 

Models 

Parameter 

log  [  log], 

X 

X 

X 

Muenz, 

Arjas  plots 

Dabrowska 

X 

X 

X 

X 

log  A 

difference 

plots 

Stratified  vs. 

X 

X 

X 

Modeled 

Estimates 

Hazard  ratio 

X 

? 

X 

X 

? 

plot 

Schoenfeld 

X 

X 

X 

residual 

plot 

Schoenfeld 

X 

X 

residual 

correlation 

test 

Fit  time- 

X 

X 

dependent 

covariables 

Ratio  of 

X 

X 

X 

X 

X 

parametric 
estimates 
of  A  (t) 

Parametric  models  that  assume  an  effect  other  than  PH,  for  example,  the 
log-logistic  model,226  can  be  used  to  allow  a  predictor  to  have  a  constantly 
increasing  or  decreasing  effect  over  time.  If  one  predictor  satisfies  PH  but 
another  does  not,  this  approach  will  not  work. 
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See  Section  4.6  for  the  general  approach  using  variance  inflation  factors. 
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20.9  Overly  Influential  Observations 

Therneau  et  al.605  describe  the  use  of  score  residuals  for  assessing  influence  in 
Cox  and  related  regression  models.  They  show  that  the  infinitesimal  jackknife 
estimate  of  the  influence  of  observation  i  on  f3  equals  Vs',  where  V  is  the 
estimated  variance-covariance  matrix  of  the  p  regression  estimates  b  and  s  = 
(sa,  Si2,  •  •  • ,  SiP )  is  the  vector  of  score  residuals  for  the  p  regression  coefficients 
for  the  ith  observation.  Let  Snxp  denote  the  matrix  of  score  residuals  over 
all  observations.  Then  an  approximation  to  the  unstandardized  change  in  b 
(DFBETA)  is  SV .  Standardizing  by  the  standard  errors  of  b  found  from  the 
diagonals  of  C,  e  =  (Vn,  V22 ,  •  •  • ,  Vpp)1^2 ,  yields 

DFBETAS  =  SV  Diag(e)"1,  (20.29) 

where  Diag(e)  is  a  diagonal  matrix  containing  the  estimated  standard  errors. 

As  discussed  in  Section  20.13,  identification  of  overly  influential  observa¬ 
tions  is  facilitated  by  printing,  for  each  predictor,  the  list  of  observations 
containing  DFBETAS  >  u  for  any  parameter  associated  with  that  predictor. 
The  choice  of  cutoff  u  depends  on  the  sample  size  among  other  things.  A 
typical  choice  might  be  u  =  0.2  indicating  a  change  in  a  regression  coefficient 
of  0.2  standard  errors. 


20.10  Quantifying  Predictive  Ability 

To  obtain  a  unitless  measure  of  predictive  ability  for  a  Cox  PH  model  we 
can  use  the  R  index  described  in  Section  9.8.3,  which  is  the  square  root  of 
the  fraction  of  log  likelihood  explained  by  the  model  of  the  log  likelihood 
that  could  be  explained  by  a  perfect  model,  penalized  for  the  complexity  of 
the  model.  The  lowest  (best)  possible  —2  log  likelihood  for  the  Cox  model  is 
zero,  which  occurs  when  the  predictors  can  perfectly  rank  order  the  survival 
times.  Therefore,  as  was  the  case  with  the  logistic  model,  the  quantity  L* 
from  Section  9.8.3  is  zero  and  an  R  index  that  is  penalized  for  the  number  of 
parameters  in  the  model  is  given  by 

R2  =  (LR  —  2p)/L°,  (20.30) 

where  p  is  the  number  of  parameters  estimated  and  L°  is  the  —2  log  likelihood 
when  (3  is  restricted  to  be  zero  (i.e.,  there  are  no  predictors  in  the  model).  R 
will  be  near  one  for  a  perfectly  predictive  model  and  near  zero  for  a  model 
that  does  not  discriminate  between  short  and  long  survival  times.  The  R 
index  does  not  take  into  account  any  stratification  factors.  If  stratification 
factors  are  present,  R  will  be  near  one  if  survival  times  can  be  perfectly  ranked 
within  strata  even  though  there  is  overlap  between  strata. 
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Schemper546  and  Korn  and  Simon  have  reported  that  R2  is  too  sen¬ 
sitive  to  the  distribution  of  censoring  times  and  have  suggested  alterna¬ 
tives  based  on  the  distance  between  estimated  Cox  survival  probabilities 
(using  predictors)  and  Kaplan-Meier  estimates  (ignoring  predictors).  Kent 
and  O’ Quigley345  also  report  problems  with  R 2  and  suggest  a  more  complex 
measure.  Schemper54  investigated  the  Maddala-Magee431,43  index  i?2R  de¬ 
scribed  in  Section  9.8.3,  applied  to  Cox  regression: 

RlR  =  1  -  exp(— LR/n) 

=  1  -w2/”,  (20.31) 

where  uj  is  the  null  model  likelihood  divided  by  the  fitted  model  likelihood. 

For  many  situations,  Rr r  performed  as  well  as  Schemper’s  more  complex 
measure546, 549  and  hence  it  is  preferred  because  of  its  ease  of  calculation 
(assuming  that  PH  holds).  Ironically,  Schemper548  demonstrated  that  the  n 
in  the  formula  for  this  index  is  the  total  number  of  observations,  not  the 
number  of  events  (but  see  O’ Quigley,  Xu,  and  Stare481).  To  make  the  R2 
index  have  a  maximum  value  of  1.0,  we  use  the  Nagelkerke4'1  R^  discussed 
in  Section  9.8.3. 

An  easily  interpretable  index  of  discrimination  for  survival  models  is  de¬ 
rived  from  Kendall’s  r  and  Somers’  Dxy  rank  correlation,579  the  Gehan- 
Wilcoxon  statistic  for  comparing  two  samples  for  survival  differences,  and 
the  Brown-Hollander-Korwar  nonparametric  test  of  association  for  censored 
data. 7 6’ 170, 262, 268  This  index,  c,  is  a  generalization  of  the  area  under  the  ROC 
curve  discussed  under  the  logistic  model,  in  that  it  applies  to  a  continuous 
response  variable  that  can  be  censored.  The  c  index  is  the  proportion  of  all 
pairs  of  subjects  whose  survival  time  can  be  ordered  such  that  the  subject 
with  the  higher  predicted  survival  is  the  one  who  survived  longer.  Two  sub¬ 
jects’  survival  times  cannot  be  ordered  if  both  subjects  are  censored  or  if  one 
has  failed  and  the  follow-up  time  of  the  other  is  less  than  the  failure  time 
of  the  first.  The  c  index  is  a  probability  of  concordance  between  predicted 
and  observed  survival,  with  c  =  0.5  for  random  predictions  and  c  =  1  for  a 
perfectly  discriminating  model.  The  c  index  is  mildly  affected  by  the  amount 
of  censoring.  Dxy  is  obtained  from  2(c  —  0.5).  While  c  (and  Dxy)  is  a  good 
measure  of  pure  discrimination  ability  of  a  single  model,  it  is  not  sensitive 
enough  to  allow  multiple  models  to  be  compared447. 

Since  high  hazard  means  short  survival  time,  when  the  linear  predictor 

/\ 

X/3  from  a  Cox  model  is  compared  with  observed  survival  time,  Dxy  will  be 
negative.  Some  analysts  may  want  to  negate  reported  values  of  Dxy. 
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20.11  Validating  the  Fitted  Model 

Separate  bootstrap  or  cross-validation  assessments  can  be  made  for  calibra¬ 
tion  and  discrimination  of  Cox  model  survival  and  log  relative  hazard  esti¬ 
mates. 


20. 11.1  Validation  of  Model  Calibration 

One  approach  to  validation  of  the  calibration  of  predictions  is  to  obtain  un¬ 
biased  estimates  of  the  difference  between  Cox  predicted  and  Kaplan-Meier 
survival  estimates  at  a  fixed  time  u.  Here  is  one  sequence  of  steps. 

1.  Obtain  outpoints  (e.g.,  deciles)  of  predicted  survival  at  time  u  so  as  to 

have  a  given  number  of  subjects  (e.g.,  50)  in  each  interval  of  predicted 

/\ 

survival.  These  outpoints  are  based  on  the  distribution  of  S(u \X)  in  the 
whole  sample  for  the  “final”  model  (for  data-splitting,  instead  use  the  model 
developed  in  the  training  sample).  Let  k  denote  the  number  of  intervals 
used. 

2.  Compute  the  average  S(u\X)  in  each  interval. 

3.  Compare  this  with  the  Kaplan-Meier  survival  estimates  at  time  iq  strat¬ 
ified  by  intervals  of  S(u\X).  Let  the  differences  be  denoted  by  d  = 

{d\ , . . . ,  . 

4.  Use  bootstrapping  or  cross-validation  to  estimate  the  overoptimism  in  d 
and  then  to  correct  d  to  get  a  more  fair  assessment  of  these  differences. 
For  each  repetition,  repeat  any  stepwise  variable  selection  or  stagewise 
significance  testing  using  the  same  stopping  rules  as  were  used  to  derive 
the  “final”  model.  No  more  than  B  =  200  replications  are  needed  to  obtain 
accurate  estimates. 

5.  If  desired,  the  bias-corrected  d  can  be  added  to  the  original  stratified 
Kaplan-Meier  estimates  to  obtain  a  bias-corrected  calibration  curve. 

However,  any  statistical  method  that  uses  binning  of  continuous  variables 
(here,  the  predicted  risk),  is  arbitrary  and  has  lower  precision  than  smooth 
estimates  that  allow  for  interpolation.  A  far  better  approach  to  estimating 
calibration  curves  for  survival  models  is  to  use  the  flexible  adaptive  hazard 
regression  approach  of  Kooperberg  et  al.361  as  discussed  on  P.  450.  Their 
method  does  not  assume  linearity  or  proportional  hazards.  Hazard  regres¬ 
sion  can  be  used  to  estimate  the  relationship  between  (suitably  transformed) 
predicted  survival  probabilities  and  observed  outcomes,  i.e.,  to  derive  a  cali¬ 
bration  curve.  The  bootstrap  is  used  to  de-bias  the  estimates  to  correct  for 
overfitting,  allowing  estimation  of  the  likely  future  calibration  performance 
of  the  fitted  model. 

As  an  example,  consider  a  dataset  of  20  random  uniformly  distributed 
predictors  for  a  sample  of  size  200.  Let  the  failure  time  be  another  random 


20.11  Validating  the  Fitted  Model 


507 


uniform  variable  that  is  independent  of  all  the  predictors,  and  censor  half  of 
the  failure  times  at  random.  Due  to  fitting  20  predictors  to  100  events,  there 
will  apparently  be  fair  agreement  between  predicted  and  observed  survival 
over  all  strata  (smooth  black  curve  from  hazard  regression  in  Figure  20.11). 
However,  the  bias-corrected  calibration  (blue  curve  from  hazard  regression) 
gives  a  more  truthful  answer:  examining  the  Xs  across  levels  of  predicted 
survival  demonstrate  that  predicted  and  observed  survival  are  weekly  related, 
in  more  agreement  with  how  the  data  were  generated.  For  the  more  arbitrary 
Kaplan-Meier  approach,  we  divide  the  observations  into  quintiles  of  predicted 
0.5-year  survival,  so  that  there  are  40  observations  per  stratum. 
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=  .5 
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s  ■  ■  ■ 
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e  (f 

u  = 

.5,  B  =200) 

Using  Cox  survival  estimates  at  0.5  Years 


plot  (cal,  ylim  =  c(.4,  1),  subt it les =FALSE ) 

calkm  V-  calibrate (f ,  u=.5,  m=40  ,  cmethod= ' KM ' ,  B=200) 


Using  Cox  survival  estimates  at  0.5  Years 


plot (calkm,  add=TRUE)  #  Figure  20.11 


20.11.2  Validation  of  Discrimination  and  Other 
Statistical  Indexes 

Here  bootstrapping  and  cross-validation  are  used  as  for  logistic  models  (Sec¬ 
tion  10.9).  We  can  obtain  bootstrap  bias-corrected  estimates  of  c  or  equiv¬ 
alently  Dxy.  To  instead  obtain  a  measure  of  relative  calibration  or  slope 
shrinkage,  we  can  bootstrap  the  apparent  estimate  of  7  =  1  in  the  model 

X(t\X)  =  X(t)  exp(yX5).  (20.32) 

Besides  being  a  measure  of  calibration  in  itself,  the  bootstrap  estimate  of 
7  also  leads  to  an  unreliability  index  U  which  measures  how  far  the  model 
maximum  log  likelihood  (which  allows  for  an  overall  slope  correction)  is  from 
the  log  likelihood  evaluated  at  “frozen”  regression  coefficients  (7=1)  (see  [267] 
and  Section  10.9). 
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Predicted  0.5  Year  Survival 

Fig.  20.11  Calibration  of  random  predictions  using  Efron’s  bootstrap  with  B  =  200 
resamples.  Dataset  has  n  =  200,  100  uncensored  observations,  20  random  predic¬ 
tors,  model  X20  =  19.  The  smooth  black  line  is  the  apparent  calibration  estimated 
by  adaptive  linear  spline  hazard  regression361,  and  the  blue  line  is  the  bootstrap 
bias-  (overfitting-)  corrected  calibration  curve  estimated  also  by  hazard  regression. 
The  gray  scale  line  is  the  line  of  identity  representing  perfect  calibration.  Black  dots 
represent  apparent  calibration  accuracy  obtained  by  stratifying  into  intervals  of  pre¬ 
dicted  0.5y  survival  containing  40  events  per  interval  and  plotting  the  mean  predicted 
value  within  the  interval  against  the  stratum’s  Kaplan-Meier  estimate.  The  blue  x 
represent  bootstrap  bias-corrected  Kaplan-Meier  estimates. 


_  LR(^Xb)  -  LR(V6) 

u  ~  Z° 


(20.33) 


where  L°  is  the  —2  log  likelihood  for  the  null  model  (Section  9.8.3).  Similarly, 
a  discrimination  index  D  can  be  derived  from  the  —2  log  likelihood  at  the 
shrunken  linear  predictor,  penalized  for  estimating  one  parameter  (7)  (see 
also  [633,  p.  1318]  and  [123]): 


LR(7X6)  -  1 
L° 

D  is  the  same  as  R 2  discussed  above  when  p  =  1  (indicating  only  one  reesti¬ 
mated  parameter,  7),  the  penalized  proportion  of  explainable  log  likelihood 
that  was  explained  by  the  model.  Because  of  the  remark  of  Schemper,546  all 
of  these  indexes  may  unfortunately  be  functions  of  the  censoring  pattern. 

An  index  of  overall  quality  that  penalizes  discrimination  for  unreliability  is 

LR(V6)  -  1 

Z° 


Q  =  D  —  U 


(20.35) 


20.12  Describing  the  Fitted  Model 


509 


Q  is  a  normalized  and  penalized  —2  log  likelihood  that  is  evaluated  at  the 
uncorrected  linear  predictor. 

For  the  random  predictions  used  in  Figure  20.11,  the  bootstrap  estimates 
with  B  =  200  resamples  are  found  in  Table  20.10. 

latex(validate(f,  B  =  200),  digits=3,  f ile  =  '  '  ,  capt i on=  '  '  , 
table . env =TRUE ,  label ='  tab : cox- val-random  ') 


Table  20.10  Bootstrap  validation  of  a  Cox  model  with  random  predictors 


Index  Original  Training  Test  Optimism  Corrected  n 
Sample  Sample  Sample  Index 


DXy 

0.213 

0.335 

0.147 

0.188 

0.025  200 

R2 

0.092 

0.191 

0.042 

0.150 

-0.058  200 

Slope 

1.000 

1.000 

0.389 

0.611 

0.389  200 

D 

0.021 

0.048 

0.009 

0.039 

-0.019  200 

u 

-0.002 

-0.002 

0.028 

-0.031 

0.028  200 

Q 

0.023 

0.050 

-0.020 

0.070 

-0.047  200 

9 

0.516 

0.878 

0.339 

0.539 

-0.023  200 

It  can  be  seen  that  the  apparent  correlation  ( Dxy  =  —0.21)  does  not  hold 
up  after  correcting  for  overhtting  ( Dxy  =  —0.02).  Also,  the  slope  shrinkage 
(0.39)  indicates  extreme  overhtting. 

See  [633,  Section  6]  and  [640]  and  Section  18.3.7  for  still  more  useful  meth¬ 
ods  for  validating  the  Cox  model. 


20.12  Describing  the  Fitted  Model 

As  with  logistic  modeling,  once  a  Cox  PH  model  has  been  fitted  and  ah 
its  assumptions  verified,  the  final  model  needs  to  be  presented  and  inter¬ 
preted.  The  fastest  way  to  describe  the  model  is  to  interpret  each  effect  in 
it.  For  each  predictor  the  change  in  log  hazard  per  desired  units  of  change 
in  the  predictor  value  may  be  computed,  or  the  antilog  of  this  quantity, 
exp (/3j  x  change  in  Xj),  may  be  used  to  estimate  the  hazard  ratio  holding 
all  other  factors  constant.  When  X3  is  a  nonlinear  factor,  changes  in  predicted 
X/3  for  sensible  values  of  Xj  such  as  quartiles  can  be  used  as  described  in 
Section  10.10.  Of  course  for  nonmodeled  stratification  factors,  this  method  is 
of  no  help.  Figure  20.12  depicts  a  way  to  display  estimated  surgical  :  medical 
hazard  ratios  in  the  presence  of  a  significant  treatment  by  disease  severity 
interaction  and  a  secular  trend  in  the  benefit  of  surgical  therapy  (treatment 
by  year  of  diagnosis  interaction). 

Often,  the  use  of  predicted  survival  probabilities  may  make  the  model 
more  interpretable.  If  the  effect  of  only  one  factor  is  being  displayed  and 
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1 - Vessel  Disease  - 

2- Vessel  Disease  - 

3- Vessel  Disease  - 

75%  Left  Main  - 
95%  Left  Main  - 

1 - Vessel  Disease  - 

2- Vessel  Disease  - 

3- Vessel  Disease  - 

75%  Left  Main  - 
95%  Left  Main  - 

1 - Vessel  Disease  - 

2- Vessel  Disease  - 

3- Vessel  Disease  - 

75%  Left  Main  - 
95%  Left  Main  - 

0.125  0.25  0.5  1.0  1.5  2.0  2.5 

Hazard  Ratio 

Fig.  20.12  A  display  of  an  interaction  between  treatment  and  extent  of  disease,  and 
between  treatment  and  calendar  year  of  start  of  treatment.  Comparison  of  medical  and 
surgical  average  hazard  ratios  for  patients  treated  in  1970,  1977,  and  1984  according 
to  coronary  disease  severity.  Circles  represent  point  estimates;  bars  represent  0.95 
confidence  limits  of  hazard  ratios.  Ratios  less  than  1.0  indicate  that  coronary  bypass 
surgery  is  more  effective.88 


1970 

1977 

1984 

A 

i 

i  i 

i 

i 

i  i 

that  factor  is  polytomous  or  predictions  are  made  for  specific  levels,  survival 
curves  (with  or  without  adjustment  for  other  factors  not  shown)  can  be  drawn 
for  each  level  of  the  predictor  of  interest,  with  follow-up  time  on  the  x-axis. 
Figure  20.2  demonstrated  this  for  a  factor  which  was  a  stratification  factor. 
Figure  20.13  extends  this  by  displaying  survival  estimates  stratified  by  treat¬ 
ment  but  adjusted  to  various  levels  of  two  modeled  factors,  one  of  which,  year 
of  diagnosis,  interacted  with  treatment. 

When  a  continuous  predictor  is  of  interest,  it  is  usually  more  informative 
to  display  that  factor  on  the  x-axis  with  estimated  survival  at  one  or  more 
time  points  on  the  y- axis.  When  the  model  contains  only  one  predictor,  even 
if  that  predictor  is  represented  by  multiple  terms  such  as  a  spline  expansion, 
one  may  simply  plot  that  factor  against  the  predicted  survival.  Figure  20.14 
depicts  the  relationship  between  treadmill  exercise  score,  which  is  a  weighted 
linear  combination  of  several  predictors  in  a  Cox  model,  and  the  probability 
of  surviving  five  years. 

When  displaying  the  effect  of  a  single  factor  after  adjusting  for  multiple 
predictors  which  are  not  displayed,  care  only  need  be  taken  for  the  values 
to  which  the  predictors  are  adjusted  (e.g.,  grand  means).  When  instead  the 
desire  is  to  display  the  effect  of  multiple  predictors  simultaneously,  an  im¬ 
portant  continuous  predictor  can  be  displayed  on  the  x-axis  while  separate 
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curves  or  graphs  are  made  for  levels  of  other  factors.  Figure  20.15,  which 
corresponds  to  the  log  yl  plots  in  Figure  20.5,  displays  the  joint  effects  of  age 
and  sex  on  the  three-year  survival  probability.  Age  is  modeled  with  a  cubic 
spline  function,  and  the  model  includes  terms  for  an  age  x  sex  interaction. 

L 

p  V-  Pr edi ct ( f . ia ,  age,  sex,  time=3) 
ggplot  (p) 


Cti 

_Q 


2  0.00  - 


0.00  - 

i  i  i  i  i  i  i  i  i  i  i  i  i  i  i  i  i  i 

01  234501  234501  2345 


Treatment 

—  Surgical 
Medical 


Years  of  Followup 


Fig.  20.13  Cox-Kalbfleisch-Prentice  survival  estimates  stratifying  on  treatment  and 
adjusting  for  several  predictors,  showing  a  secular  trend  in  the  efficacy  of  coronary 
artery  bypass  surgery.  Estimates  are  for  patients  with  left  main  disease  and  normal 
(LVEF=0.6)  or  impaired  (LVEF=0.4)  ventricular  function.516 


Besides  making  graphs  of  survival  probabilities  estimated  for  given  levels 
of  the  predictors,  nomograms  have  some  utility  in  specifying  a  fitted  Cox 
model.  A  nomogram  can  be  used  to  compute  X/3,  the  estimated  log  hazard  for 
a  subject  with  a  set  of  predictor  values  X  relative  to  the  “standard”  subject. 
The  central  line  in  the  nomogram  will  be  on  this  linear  scale  unlike  the  logistic 
model  nomograms  given  in  Section  10.10  which  further  transformed  X/3  into 
[1  +  exp(— X/3)\L.  Alternatively,  the  central  line  could  be  on  the  nonlinear 
exp  (X/3)  hazard  ratio  scale  or  survival  at  fixed  t. 

A  graph  of  the  estimated  underlying  survival  function  S(t)  as  a  function 
of  t  can  be  coupled  with  the  nomogram  used  to  compute  X/3.  The  survival 
for  a  specific  subject,  S(t\X)  is  obtained  from  A(t)exp^A/3\  Alternatively,  one 
could  graph  S(t)exp(x^  for  various  values  of  Xp  (e.g.,  Xft  =  —2,  —1,0, 1,2) 
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Fig.  20.14  Cox  model  predictions  with  respect  to  a  continuous  variable.  X-axis 
shows  the  range  of  the  treadmill  score  seen  in  clinical  practice  and  y-axis  shows  the 
corresponding  five-year  survival  probability  predicted  by  the  Cox  regression  model 
for  the  2842  study  patients.440 
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Fig.  20.15  Survival  estimates  for  model  stratified  on  sex,  with  interaction. 


so  that  the  desired  survival  curve  could  be  read  directly,  at  least  to  the  nearest 

tabulated  X/3.  For  estimating  survival  at  a  fixed  time,  say  two  years,  one  only 

/\ 

need  to  provide  the  constant  S(t).  The  nomogram  could  even  be  adapted  to 

include  a  nonlinear  scale  g(2)exp(X/3)  to  allow  direct  computation  of  two-year 
survival. 
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Harrell’s  cpower,  spower,  and  ciapower  (in  the  Hmisc  package)  perform  power 
calculations  for  Cox  tests  in  follow-up  studies,  cpower  computes  power  for 
a  two-sample  Cox  (log-rank)  test  with  random  patient  entry  over  a  fixed 
duration  and  a  given  length  of  minimum  follow-up.  The  expected  number  of 
events  in  each  group  is  estimated  by  assuming  exponential  survival,  cpower 
uses  a  slight  modification  of  the  method  of  Schoenfeld558  (see  [501]).  Separate 
specification  of  noncompliance  in  the  active  treatment  arm  and  “drop-in”  from 
the  control  arm  into  the  active  arm  are  allowed,  using  the  method  of  Lachin 
and  Foulkes.370  The  ciapower  function  computes  power  of  the  Cox  interaction 
test  in  a  2  x  2  setup  using  the  method  of  Peterson  and  George.501  It  does 
not  take  noncompliance  into  account.  The  spower  function  simulates  power 
for  two-sample  tests  (the  log-rank  test  by  default)  allowing  for  very  complex 
conditions  such  as  continuously  varying  treatment  effect  and  noncompliance 
probabilities. 

The  rms  package  cph  function  is  a  slight  modification  of  the  coxph  func¬ 
tion  written  by  Terry  Therneau  (in  his  survival  package  to  work  in  the  rms 
framework,  cph  computes  MLEs  of  Cox  and  stratified  Cox  PH  models,  overall 

score  and  likelihood  ratio  y2  statistics  for  the  model,  martingale  residuals,  the 

/\ 

linear  predictor  (X/3  centered  to  have  mean  0),  and  collinearity  diagnostics. 
Efron,  Breslow,  and  exact  partial  likelihoods  are  supported  (although  the 
exact  likelihood  is  very  computationally  intensive  if  ties  are  frequent).  The 
function  also  fits  the  Andersen-Gill  generalization  of  the  Cox  PH  model. 
This  model  allows  for  predictor  values  to  change  over  time  in  the  form  of  step 
functions  as  well  as  allowing  time-dependent  stratification  (subjects  can  jump 
to  different  hazard  function  shapes).  The  Andersen-Gill  formulation  allows 
multiple  events  per  subject  and  permits  subjects  to  move  in  and  out  of  risk  at 
any  desired  time  points.  The  latter  feature  allows  time  zero  to  have  a  more 

general  definition.  (See  Section  9.5  for  methods  of  adjusting  the  variance- 

/\ 

covariance  matrix  of  f3  for  dependence  in  the  events  per  subject.)  The  print¬ 
ing  function  corresponding  to  cph  prints  the  Nagelkerke  index  R ^  described 
in  Section  20.10,  and  has  a  latex  option  for  better  output,  cph  works  in  con¬ 
junction  with  the  generic  functions  such  as  specs,  predict,  summary,  anova, 
fastbw,  which. inf luence ,  latex,  residuals,  coef ,  nomogram,  and  Predict  de¬ 
scribed  in  Section  20.13,  the  same  as  the  logistic  regression  function  lrm  does. 
For  the  purpose  of  plotting  predicted  survival  at  a  single  time,  Predict  has  an 
additional  argument  time  for  plotting  cph  fits.  It  also  has  an  argument  loglog 
which  if  true  causes  instead  log-log  survival  to  be  plotted  on  the  y- axis,  cph 
has  all  the  arguments  described  in  Section  20.13  and  some  that  are  specific 
to  it. 

Similar  to  functions  for  psm,  there  are  Survival,  Quantile,  and  Mean  functions 
which  create  other  R  functions  to  evaluate  survival  probabilities  and  perform 
other  calculations,  based  on  a  cph  fit  with  surv=TRUE.  These  functions,  un¬ 
like  all  the  others,  allow  polygon  (linear  interpolation)  estimation  of  survival 
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probabilities,  quantiles,  and  mean  survival  time  as  an  option.  Quantile  is  the 
only  automatic  way  for  obtaining  survival  quantiles  with  cph.  Quantile  esti¬ 
mates  will  be  missing  when  the  survival  curve  does  not  extend  long  enough. 
Likewise,  survival  estimates  will  be  missing  for  t  >  maximum  follow-up  time, 
when  the  last  event  time  is  censored.  Mean  computes  the  mean  survival  time 
if  the  last  failure  time  in  each  stratum  is  uncensored.  Otherwise,  Mean  may 
be  used  to  compute  restricted  mean  lifetime  using  a  user-specified  trunca¬ 
tion  point.334  Quantile  and  Mean  are  especially  useful  with  plot  and  nomogram. 
Survival  is  useful  with  nomogram. 

The  R  program  below  demonstrates  how  several  cph-related  functions  work 
well  with  the  nomogram  function.  Here  predicted  three-year  survival  probabil¬ 
ities  and  median  survival  time  (when  defined)  are  displayed  against  age  and 
sex  from  the  previously  simulated  dataset.  The  fact  that  a  nonlinear  effect 
interacts  with  a  stratified  factor  is  taken  into  account. 


surv 

V-  Survi val ( f . ia ) 

L 

surv  .  f 

V-  function(lp)  surv (3 ,  lp ,  stratum= ' sex=Female  ') 

surv  .  m 

V-  function(lp)  surv (3 ,  lp  ,  stratum= ' sex  =  Male  '  ) 

quant 

V-  Quantile (f . ia) 

med .  f 

V-  function(lp)  quant ( . 5 ,  lp ,  stratum=  '  sex  =  Female 

') 

med .  m 

V-  function(lp)  quant ( . 5 ,  lp ,  stratum= ' sex=Male ' ) 

at . surv 

V-  c(.01,  .05,  seq  (  .  1  ,  .  9  ,  by  =  .  1 )  ,  .95,  .98,  .99, 

999) 

at .med 

V-  c(0,  .5,  1,  1.5,  seq(2,  14,  by=2)) 

n  V-  nomogr am  (  f  .  ia  ,  f un  =  list ( surv . m  ,  surv.f  ,  med. m, med. f) 
fun lab el  =  c  (  '  S  (3  I  Male)','S(3  |  Female)', 

'Median  (Male)  '  ,  'Median  (Female)  ')  , 
fun .  at=list(c(.8, .9, .95 , .98 , .99) , 

c ( . 1 ,  .3,  .5,  .  7  ,  .8,  .9,  .95,  .98), 
c  (8 , 10 , 12)  ,  c(l,2,4,8,12))) 

J 

plot (n , 

col . gr id=FALSE ,  lmgp=.2) 

lat ex ( f  . 

ia ,  f ile=  '  '  ,  digits  =3) 

Prob{T  >  t 


sex 


i}  =  si(tfx\ 


where 


Xf3  = 

-1.8 

+0.0493age  -  2.15  x  10"6(age  -  30.3)1  -  2.82  x  10_5(age  -  45. 1)1 
+5.18  x  10-5(age  -  54.6)1  -  2.15  x  10“5(age  -  69.6)1 
+  [Female]  [— 0.0366age  +  4.29  x  10"5(age  -  30.3)1  -  0.00011(age  -  45.1)1 
+6.74  x  10-5(age  -  54.6)1  -  2.32  x  10"7(age  -  69.6)1] 


and  [c 
otherwise. 


1  if  subject  is  in  group  c,  0  otherwise; 


x  if  x  >  0,  0 
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Fig.  20.16  Nomogram  from  a  fitted  stratified  Cox  model  that  allowed  for  interaction 
between  age  and  sex,  and  nonlinearity  in  age.  The  axis  for  median  survival  time  is 
truncated  on  the  left  where  the  median  is  beyond  the  last  follow-up  time. 


rcspline . plot  (  lvef  ,  d.time,  event  =  cdeath ,  nk=3) 

The  corresponding  smoothed  martingale  residual  plot  for  LVEF  in  Figure  20.7 
was  created  with 
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cox  cph ( Surv ( d . t ime ,  cdeath)  ~  lvef,  iter .max=0) 

res  <—  resid(cox) 
g  ~  loess(res  ~  lvef) 

plot(g,  coverage =0 . 95  ,  conf idence =7 ,  xlab="LVEF", 
ylab  =  " Mart  ingale  Residual") 
g  e-  ols (res  ~  res (lvef  ,5)) 
plot(g,  lvef=NA,  add=T,  lty=2) 
lines ( lowess ( lvef  ,  res,  iter=0)  ,  lty=3) 

legend(.3,  1.15,  cO'loess  Fit  and  0.95  Confidence  Bars", 

"ols  Spline  Fit  and  0.95  Confidence  Limits", 
"lowess  Smoother"),  lty=l:3,  bty="n") 

Because  we  desired  residuals  with  respect  to  the  omitted  predictor  LVEF, 
the  parameter  iter.max=0  had  to  be  given  to  make  cph  stop  the  estimation 
process  at  the  starting  parameter  estimates  (default  of  zero).  The  effect  of  this 
is  to  ignore  the  predictors  when  computing  the  residuals;  that  is,  to  compute 
residuals  from  a  fiat  line  rather  than  the  usual  residuals  from  a  fitted  straight 
line. 

The  residuals  function  is  a  slight  modification  of  Therneau’s  residuals. - 
coxph  function  to  obtain  martingale,  Schoenfeld,  score,  deviance  residuals,  or 
approximate  DFBETA  or  DFBETAS.  Since  martingale  residuals  are  always 
stored  by  cph  (assuming  there  are  covariables  present),  residuals  merely  has 
to  pick  them  off  the  fit  object  and  reinsert  rows  that  were  deleted  due  to 
missing  values.  For  other  residuals,  you  must  have  stored  the  design  matrix 
and  Surv  object  with  the  fit  by  using  . . . ,  x=TRUE,  y=TRUE.  Storing  the  design 
matrix  with  x=TRUE  ensures  that  the  same  transformation  parameters  (e.g., 
knots)  are  used  in  evaluating  the  model  as  were  used  in  fitting  it.  To  use 
residuals  you  can  use  the  abbreviation  res  id.  See  the  help  file  for  residuals 
for  an  example  of  how  martingale  residuals  may  be  used  to  quickly  plot 
univariable  (unadjusted)  relationships  for  several  predictors. 

Figure  20.10,  which  used  smoothed  scaled  Schoenfeld  partial  residuals557 
to  estimate  the  form  of  a  predictor’s  log  hazard  ratio  over  time,  was  made 
with 

L 

Srv  <—  Surv ( dm . t ime  , edeathmi  ) 
cox  V-  cph(Srv  ~  pi,  x  =  T ,  y  =  T) 

cox.zph  (  cox,  "rank")  #  Test  for  PH  for  each  column  of  X 
res  V-  resid (cox ,  "scaledsch") 
time  as . numeri c ( names ( res ) ) 

#  Use  dimnames (res ) [ [1] ]  if  more  than  one  predictor 

f  V-  loess  (res  ~  time,  span  =  0.50) 

plot  (f  ,  coverage =0 . 95  ,  conf idence =7 ,  xlab  =  "t"  , 

ylab  =  "Scaled  Schoenfeld  Residual",  y 1 im  =  c ( - . 1  ,  .  25  )  ) 

1 ine s ( supsmu ( t ime  ,  res),lty=2) 

legend  ( 1 . 1 ,.  21  , c (" loess  Smoother  with  span  =  0.50  and  0.95  C.L.", 

"Super  Smoother"),  lty=l:2,  bty="n") 

The  computation  and  plotting  of  scaled  Schoenfeld  residuals  could  have  been 
done  automatically  in  this  case  by  using  the  single  command  plot  (cox.zph 
(cox)),  although  cox.zph  defaults  to  plotting  against  the  Kaplan-Meier  trans¬ 
formation  of  follow-up  time. 
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The  hazard. ratio. plot  function  in  rms  repeatedly  estimates  Cox  regression 
coefficients  and  confidence  limits  within  time  intervals.  The  log  hazard  ra¬ 
tios  are  plotted  against  the  mean  failure/censoring  time  within  the  interval. 
Figure  20.9  was  created  with 

hazard . rat io . plot  (pi  ,  S)  #5  was  Surv ( dm . time  ,  ...) 

If  you  have  multiple  degree  of  freedom  factors,  you  may  want  to  score  them 
into  linear  predictors  before  using  hazard. ratio. plot.  The  predict  function 
with  argument  type=" terms"  will  produce  a  matrix  with  one  column  per  factor 
to  do  this  (Section  20.13). 

Therneau’s  cox.zph  function  implements  Harrell’s  Schoenfeld  residual  cor¬ 
relation  test  for  PH.  This  function  also  stores  results  that  can  easily  be  passed 
to  a  plotting  method  for  cox.zph  to  automatically  plot  smoothed  residuals 
that  estimate  the  effect  of  each  predictor  over  time. 

Therneau  has  also  written  an  R  function  survdiff  that  compares  two  or 
more  survival  curves  using  the  G  —  p  family  of  rank  tests  (Harrington  and 
Fleming273). 

The  rcorr.cens  function  in  the  Hmisc  library  computes  the  c  index  and  the 
corresponding  generalization  of  Somers’  Dxy  rank  correlation  for  a  censored 
response  variable,  rcorr.cens  also  works  for  uncensored  and  binary  responses 
(see  ROC  area  in  Section  10.8),  although  its  use  of  all  possible  pairings  makes 
it  slow  for  this  purpose.  The  survival  package’s  survConcordance  has  an  ex¬ 
tremely  fast  algorithm  for  the  c  index  and  a  fairly  accurate  estimator  of  its 
standard  error. 

The  calibrate  function  for  cph  constructs  a  bootstrap  or  cross-validation 
optimism-corrected  calibration  curve  for  a  single  time  point  by  resampling 
the  differences  between  average  Cox  predicted  survival  and  Kaplan-Meier  es¬ 
timates  (see  Section  20.11.1).  But  more  precise  is  calibrated  default  method 
based  on  adaptive  semiparametric  regression  discussed  in  the  same  section. 
Figure  20.11  is  an  example. 

The  validate  function  for  cph  fits  validates  several  statistics  describing  Cox 
model  fits — slope  shrinkage,  R^,Z),/7,  Q,  and  Dxy.  The  val.surv  function 
can  also  be  of  use  in  externally  validating  a  Cox  model  using  the  methods 
presented  in  Section  18.3.7. 
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Good  general  texts  for  the  Cox  PH  model  include  Cox  and  Oakes,133  Kalbfleisch 
and  Prentice,331  Lawless,382  Collett,114  Marubini  and  Valsecchi,444  and  Klein 
and  Moeschberger.350  Therneau  and  Grambsch604  describe  the  many  ways  the 
standard  Cox  model  may  be  extended. 

Cupples  et  al.141  and  Marubini  and  Valsecchi  [444,  pp.  201-206]  present  good 
description  of  various  methods  of  computing  “adjusted  survival  curves.” 

See  Altman  and  Andersen15  for  simpler  approximate  formulas.  Cheng  et  al.103 
derived  methods  for  obtaining  pointwise  and  simultaneous  confidence  bands  for 
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S(t)  for  future  subjects,  and  Henderson282  has  a  comprehensive  discussion  of 
the  use  of  Cox  models  to  estimate  survival  time  for  individual  subjects. 

Aalen2  and  Valsecchi  et  al.625  discuss  other  residuals  useful  in  graphically  check¬ 
ing  survival  model  assumptions.  Leon  and  Tsai400  derived  residuals  for  estimat¬ 
ing  covariate  transformations  that  are  different  from  martingale  residuals. 

[411]  has  other  methods  for  generating  confidence  intervals  for  martingale  resid¬ 
ual  plots. 

Lin  et  al.411  describe  other  methods  of  checking  transformations  using  cumu¬ 
lative  martingale  residuals. 

A  parametric  analysis  of  the  VA  dataset  using  linear  splines  and  incorporating 
A  x  t  interactions  is  found  in  [361]. 

Winnett  and  Sasieni6'1  show  how  to  use  scaled  Schoenfeld  residuals  in  an  iter¬ 
ative  fashion  to  actually  model  effects  that  are  not  in  proportional  hazards. 

See  [233,  503]  for  some  methods  for  obtaining  confidence  bands  for  Schoen¬ 
feld  residual  plots.  Winnett  and  Sasieni6'0  discuss  conditions  in  which  the 
Grambsch-Therneau  scaling  of  the  Schoenfeld  residuals  does  not  perform  ade¬ 
quately  for  estimating  p(t). 

[475,  519]  compared  the  power  of  the  test  for  PH  based  on  the  correlation  be¬ 
tween  failure  time  and  Schoenfeld  residuals  with  the  power  of  several  other 
tests. 

See  Lin  et  al.411  for  another  approach  to  deriving  a  formal  test  of  PH  using 
residuals.  Other  graphical  methods  for  examining  the  PH  assumption  are  due 
to  Gray,236  who  used  hazard  smoothing  to  estimate  hazard  ratios  as  a  function 
of  time,  and  Thaler,602  who  developed  a  nonparametric  estimator  of  the  hazard 
ratio  over  time  for  time-dependent  covariables.  See  Valsecchi  et  al.625  for  other 
useful  graphical  assessments  of  PH. 

A  related  test  of  constancy  of  hazard  ratios  may  be  found  in  [519].  Also,  see 
Schemper54'  for  related  methods. 

See  [547]  for  a  variation  of  the  standard  Cox  likelihood  to  allow  for  non-PH. 
An  excellent  review  of  graphical  methods  for  assessing  PH  may  be  found  in 
Hess.290.  Sahoo  and  Sengupta537  provide  some  new  graphical  methods  for  as¬ 
sessing  PH  irrespective  of  satisfaction  of  the  other  model  assumptions. 
Schemper547  provides  a  way  to  determine  the  effect  of  falsely  assuming  PH  by 
comparing  the  Cox  regression  coefficient  with  a  well-described  average  log  haz¬ 
ard  ratio.  Zucker691  shows  how  dependent  a  weighted  log-rank  test  is  on  the  true 
hazard  ratio  function,  when  the  weights  are  derived  from  a  hypothesized  hazard 
ratio  function.  Valsecchi  et  al.625  proposed  a  method  that  is  robust  to  non-PH 
that  occurs  in  the  late  follow-up  period.  Their  method  uses  down-weighting  of 
certain  types  of  “outliers.”  See  Herndon  and  Harrell28'  for  a  flexible  paramet¬ 
ric  PH  model  with  time-dependent  covariables,  which  uses  the  restricted  cubic 
spline  function  to  specify  A (t).  Putter  et  al.518  and  Muggeo  and  Tagliavia468 
have  nice  approaches  that  use  time-dependent  covariates  to  model  time  inter¬ 
actions  to  allow  non-proportional  hazards.  Perperoglou  et  al.498,499  developed 
a  systematic  approach  that  allows  one  to  continuously  vary  the  amount  of  non 
PH  allowed,  through  the  use  of  a  structure  matrix  that  connects  predictors 
with  functions  of  time.  Schuabel  et  al.543  have  a  good  exposition  of  internal 
time-dependent  covariates. 

See  van  Houwelingen  and  le  Cessie  [633,  Eq.  61]  and  Verweij  and  van  Houwelin- 
gen640  for  an  interesting  index  of  cross- validated  predictive  accuracy.  Schemper 
and  Henderson302  relate  explained  variation  to  predictive  accuracy  in  Cox  mod¬ 
els.  Hielscher  et  al.291  compares  and  illustrates  several  measures  of  explained 
variation  as  does  Choodari-Oskooei  et  al.106.  Choodari-Oskooei  et  al.105  stud¬ 
ied  explained  randomness  and  predictive  accuracy  measures. 

See  similar  indexes  in  Schemper544  and  a  related  idea  in  [633,  Eq.  63].  Man- 
del,  Galai,  and  Simchen436  presented  a  time-varying  c  index.  See  Korn  and 
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Simon,365  Schemper  and  Stare,554  and  Henderson282  for  nice  comparisons  of 
various  measures.  Pencina  and  D’Agostino489  provide  more  details  about  the  c 
index  and  derived  new  interval  estimates.  They  also  discussed  the  relationship 
between  c  and  a  version  of  Kendall’s  r.  Pencina  et  al.491  found  advantages  of  c. 
Uno  et  al.618  described  exactly  how  c  depends  on  the  amount  of  censoring  and 
proposed  a  new  index,  requiring  one  to  choose  a  time  cutoff,  that  is  invariant  to 
the  amount  of  censoring.  Henderson  et  al.283  discussed  the  benefits  of  using  the 
probability  of  a  serious  prognostication  error  (e.g.,  being  off  by  a  factor  of  2.0 
or  worse  on  the  time  scale)  as  an  accuracy  measure.  Schemper550  shows  that 
models  with  very  important  predictors  can  have  very  low  absolute  prediction 
ability,  and  he  discusses  measures  of  predictive  accuracy  from  a  general  stand¬ 
point.  Lawless  and  Yuan386  present  prediction  error  estimators  and  confidence 
limits,  focusing  on  such  measures  as  error  in  predicted  median  or  mean  survival 
time.  Schmid  and  Potapov555  studied  the  bias  of  several  variations  on  the  c  in¬ 
dex  under  non-proportional  hazards  and/or  nonrandom  censoring.  Gonen  and 
Heller223  developed  a  c-index  that  is  censoring-independent. 

18]  Altman  and  Royston18  have  a  good  discussion  of  validation  of  prognostic  models 
and  present  several  examples  of  validation  using  a  simple  discrimination  index. 
Thomas  Gerds  has  an  R  package  pec  that  provides  many  validation  methods 
and  accuracy  indexes. 

19]  Kattan  et  al.338  describe  how  to  make  nomograms  for  deriving  predicted  sur¬ 
vival  probabilities  when  there  are  competing  risks. 

20]  Hielscher  et  al.291  provides  an  overview  of  software  for  computing  accuracy 
indexes  with  censored  data. 


Chapter  21 

Case  Study  in  Cox  Regression 


21.1  Choosing  the  Number  of  Parameters  and  Fitting 
the  Model 

Consider  the  randomized  trial  of  estrogen  for  treatment  of  prostate  cancer8^ 
described  in  Chapter  8.  Let  us  now  develop  a  model  for  time  until  death 
(of  any  cause).  There  are  354  deaths  among  the  502  patients.  To  be  able 
to  efficiently  estimate  treatment  benefit,  to  test  for  differential  treatment 
effect,  or  to  estimate  prognosis  or  absolute  treatment  benefit  for  individual 
patients,  we  need  a  multivariable  survival  model.  In  this  case  study  we  do  not 
make  use  of  data  reductions  obtained  in  Chapter  8  but  show  simpler  (partial) 
approaches  to  data  reduction.  We  do  use  the  transcan  results  for  imputation. 

First  let’s  assess  the  wisdom  of  fitting  a  full  additive  model  that  does  not 
assume  linearity  of  effect  for  any  predictor.  Categorical  predictors  are  ex¬ 
panded  using  dummy  variables.  For  pf  we  could  lump  the  last  two  categories 
as  before  since  the  last  category  has  only  two  patients.  Likewise,  we  could 
combine  the  last  two  levels  of  ekg.  Continuous  predictors  are  expanded  by 
fitting  four-knot  restricted  cubic  spline  functions,  which  contain  two  nonlin¬ 
ear  terms  and  thus  have  a  total  of  three  d.f.  Table  21.1  defines  the  candidate 
predictors  and  lists  their  d.f.  The  variable  stage  is  not  listed  as  it  can  be 
predicted  with  high  accuracy  from  sz,sg,ap,bm  (stage  could  have  been  used 
as  a  predictor  for  imputing  missing  values  on  sz,  sg).  There  are  a  total  of  36 
candidate  d.f.  that  should  not  be  artificially  reduced  by  “univariable  screen¬ 
ing”  or  graphical  assessments  of  association  with  death.  This  is  about  1/10 
as  many  predictor  d.f.  as  there  are  deaths,  so  there  is  some  hope  that  a  fitted 
model  may  validate.  Let  us  also  examine  this  issue  by  estimating  the  amount 
of  shrinkage  using  Equation  4.3.  We  first  use  transcan  impute  missing  data. 

L 

require ( rms ) 
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21  Case  Study  in  Cox  Regression 


Table  21.1  Initial  allocation  of  degrees  of  freedom 


Predictor 

Name  d.f.  Original  Levels 

Dose  of  estrogen 

rx 

3 

placebo,  0.2,  1.0,  5.0  mg 
estrogen 

Age  in  years 

age 

3 

Weight  index:  wt(kg)— ht(cm)+200 

wt 

3 

Performance  rating 

pf 

2 

normal,  in  bed  <  50%  of 
time,  in  bed  >  50%,  in 
bed  always 

History  of  cardiovascular  disease 

hx 

1 

present  /  absent 

Systolic  blood  pressure/ 10 

sbp 

3 

Diastolic  blood  pressure/ 10 

dbp 

3 

Electrocardiogram  code 

ekg 

5 

normal,  benign,  rhythm 
disturb.,  block,  strain, 
old  myocardial  infarction, 
new  MI 

Serum  hemoglobin  (g/lOOml) 

hg 

3 

Tumor  size  (cm2) 

sz 

3 

Stage/histologic  grade  combination 

sg 

3 

Serum  prostatic  acid  phosphatase 

ap 

3 

Bone  metastasis 

bm 

1 

present  /  absent 

getHdata (prostate) 

levels  (prostate$ekg)  [levels  (prostate$ekg)  0/0in0/0 

c  (  'old  MI',  'recent  MI')]  e-  'MI' 

#  combines  last  2  levels  and  uses  a  new  name ,  MI 

prost  at  e  $pf  .  coded  V-  as  .  int  eger  (  pr o st  at  e  $pf  ) 

#  save  original  pf ,  re-code  to  1 

levels  ( pr o st at e $pf )  c (levels  ( pr o s t at e $pf )  [1:3]  , 

levels  (prost at e$pf)  [3]) 

#  combine  last  2  levels 

w  V-  transcan  C  sz  +  sg  +  ap  +  sbp  +  dbp  +  age  + 

wt  +  hg  +  ekg  +  pf  +  bm  +  hx ,  imputed=TRUE , 
dat a=prost at e ,  pl=FALSE ,  pr=FALSE) 

attach (prostate) 

sz  V-  impute (w,  sz ,  data=prost at e ) 
sg  impute (w,  sg ,  data=prost at e ) 

age  V-  impute (w,  age , data=prost at e ) 
wt  impute  (w,  wt  ,  data=prost at e ) 

ekg  V-  impute (w,  ekg , data=prost at e ) 

dd  V-  dat adi s t ( pr o st at e ) ;  opt i ons ( dat adi st =  '  dd  '  ) 
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unit s ( dt ime )  'Month' 

S  V-  Surv(dtime ,  status  !=  'alive  ') 

f  V-  cph(S  rx  +  res (age, 4)  +  rcs(wt,4)  +  pf  +  hx  + 

rcs(sbp,4)  +  rcs(dbp,4)  +  ekg  +  rcs(hg,4)  + 
rcs(sg,4)  +  res (sz ,4)  +  res ( log ( ap ) , 4)  +  bm) 


print (f ,  latex=TRUE,  coefs=FALSE) 


Cox  Proportional  Hazards  Model 

eph (formula  =  S  ~  rx  +  res (age,  4)  +  rcs(wt,  4)  +  pf  +  hx 
+  rcs(sbp,  4)  +  rcs(dbp,  4)  +  ekg  +  rcs(hg,  4) 

+  rcs(sg,  4)  +  rcs(sz,  4)  +  rcs(log(ap),  4)  +  bm) 


Model  Tests 

Discrimination 

Indexes 

Obs  502 

Events  354 
Center  -2.9933 

LR  x2  136.22 
d.f.  36 

Pr(>  x2)  0.0000 
Score  x2  143.62 
Pr(>  x2)  0.0000 

R2  0.238 

DXy  0.333 

g  0.787 

gr  2.196 

The  likelihood  ratio  y2  statistic  is  136.2  with  36  d.f.  This  test  is  highly 
significant  so  some  modeling  is  warranted.  The  AIC  value  (on  the  y2  scale)  is 
136.2  — 2  x  36  =  64.2.  The  rough  shrinkage  estimate  is  0.74  (100.2/136.2)  so  we 
estimate  that  0.26  of  the  model  fitting  will  be  noise,  especially  with  regard  to 
calibration  accuracy.  The  approach  of  Spiegelhalter58  is  to  fit  this  full  model 
and  to  shrink  predicted  values.  We  instead  try  to  do  data  reduction  (blinded 
to  individual  y2  statistics  from  the  above  model  fit)  to  see  if  a  reliable  model 
can  be  obtained  without  shrinkage.  A  good  approach  at  this  point  might 
be  to  do  a  variable  clustering  analysis  followed  by  single  degree  of  freedom 
scoring  for  individual  predictors  or  for  clusters  of  predictors.  Instead  we  do 
an  informal  data  reduction.  The  strategy  is  described  in  Table  21.2.  For  ap, 
more  exploration  is  desired  to  be  able  to  model  the  shape  of  effect  with  such  a 
highly  skewed  distribution.  Since  we  expect  the  tumor  variables  to  be  strong 
prognostic  factors  we  retain  them  as  separate  variables.  No  assumption  is 
made  for  the  dose-response  shape  for  estrogen,  as  there  is  reason  to  expect  a 
non-monotonic  effect  due  to  competing  risks  for  cardiovascular  death. 

L 

heart  hx  +  ekg  #/,nin#/0  c  (  '  normal  '  ,  '  benign  '  ) 

label (heart)  V-  'Heart  Disease  Code' 
map  (2*dbp  +  sbp)/3 

label  (map)  V-  'Mean  Arterial  Pressure/10' 
dd  V-  datadist(dd,  heart  ,  map) 

f  V-  eph  (  S  rsj  rx  +  rcs(age,4)  +  rcs(wt,3)  +  pf . coded  + 
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Table  21.2  Final  allocation  of  degrees  of  freedom 


Variables 

Reductions 

d.f.  Saved 

wt 

Assume  variable  not  important  enough 
for  4  knots;  use  3  knots 

1 

pf 

Assume  linearity 

1 

hx , ekg 

Make  new  0,1,2  variable  and  assume 
linearity:  2  =  hx  and  ekg  not  normal 
or  benign,  1  =  either,  0  =  none 

5 

sbp , dbp 

Combine  into  mean  arterial  bp  and 
use  3  knots:  map  =  (2  dbp  +  sbp)/ 3 

4 

sg 

Use  3  knots 

1 

sz 

Use  3  knots 

1 

ap 

Look  at  shape  of  effect  of  ap  in  detail, 
and  take  log  before  expanding  as  spline 
to  achieve  numerical  stability:  add  1  knots 

-1 

heart  +  res (map, 3)  +  rcs(hg,4)  + 

rcs(sg,3)  +  res (sz ,3)  +  res ( log ( ap ) , 5)  +  bm , 

x  =  TRUE ,  y  =  TRUE ,  surv  =  TRUE ,  t ime . inc =5  *  12) 
print (f ,  latex=TRUE,  coefs=3) 


Cox  Proportional  Hazards  Model 

cph(f ormula  =  S  ~  rx  +  res (age,  4)  +  rcs(wt,  3)  +  pf .coded  + 
heart  +  rcs(map,  3)  +  rcs(hg,  4)  +  rcs(sg,  3)  + 
rcs(sz,  3)  +  rcs(log(ap),  5)  +  bm,  x  =  TRUE,  y  =  TRUE, 
surv  =  TRUE,  time. inc  =  5  *  12) 


Model  Tests 

Discrimination 

Indexes 

Obs  502 

Events  354 
Center  -2.4307 

LRy2  118.37 
d.f.  24 

Pr(>  x2)  0.0000 
Score  x2  125.58 
Pr(>  x2)  0.0000 

R2  0.210 

Dxy  0.321 

g  0.717 

gr  2.049 

rx=0.2  mg  estrogen 
rx=1.0  mg  estrogen 
rx=5.0  mg  estrogen 


Coef  S.E. 
-0.0002  0.1493 
-0.4160  0.1657 
-0.1107  0.1571 


Wald  Z  Pr(>  \Z\) 


0.00 

-2.51 

-0.70 


0.9987 

0.0121 

0.4812 
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Table  21.3  Wald  Statistics  for  S 


x2 

d.f. 

P 

rx 

8.01 

3 

0.0459 

age 

13.84 

3 

0.0031 

Nonlinear 

9.06 

2 

0.0108 

wt 

8.21 

2 

0.0165 

Nonlinear 

2.54 

1 

0.1110 

pf.coded 

3.79 

1 

0.0517 

heart 

23.51 

1  <  0.0001 

map 

0.04 

2 

0.9779 

Nonlinear 

0.04 

1 

0.8345 

hg 

12.52 

3 

0.0058 

Nonlinear 

8.25 

2 

0.0162 

sg 

1.64 

2 

0.4406 

Nonlinear 

0.05 

1 

0.8304 

sz 

12.73 

2 

0.0017 

Nonlinear 

0.06 

1 

0.7990 

ap 

6.51 

4 

0.1639 

Nonlinear 

6.22 

3 

0.1012 

bm 

0.03 

1 

0.8670 

TOTAL  NONLINEAR 

23.81 

11 

0.0136 

TOTAL 

119.09 

24  <  0.0001 

#  x ,  y  for  predict ,  validate ,  calibrate; 

#  surv ,  time.inc  for  calibrate 

latex (anova(f) , f ile= '  1  , label= 1  tab : coxcase-anoval  1 ) #  Table  21.3 

The  total  savings  is  thus  12  d.f.  The  likelihood  ratio  y2  is  118  with  24  d.f., 
with  a  slightly  improved  AIC  of  70.  The  rough  shrinkage  estimate  is  slightly 
better  at  0.80,  but  still  worrisome.  A  further  data  reduction  could  be  done, 
such  as  using  the  transcan  transformations  determined  from  self-consistency 
of  predictors,  but  we  stop  here  and  use  this  model. 

From  Table  21.3  there  are  11  parameters  associated  with  nonlinear  effects, 
and  the  overall  test  of  linearity  indicates  the  strong  presence  of  nonlinearity 
for  at  least  one  of  the  variables  age,wt ,map,hg,sz,sg,ap.  There  is  no  strong 
evidence  for  a  difference  in  survival  time  between  doses  of  estrogen. 


21.2  Checking  Proportional  Hazards 

Now  that  we  have  a  tentative  model,  let  us  examine  the  model’s  distributional 
assumptions  using  smoothed  scaled  Schoenfeld  residuals.  A  messy  detail  is 
how  to  handle  multiple  regression  coefficients  per  predictor.  Here  we  do  an 
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approximate  analysis  in  which  each  predictor  is  scored  by  adding  up  all  that 
predictor’s  terms  in  the  model,  to  transform  that  predictor  to  optimally  relate 
to  the  log  hazard  (at  least  if  the  shape  of  the  effect  does  not  change  with 
time).  In  doing  this  we  are  temporarily  ignoring  the  fact  that  the  individual 
regression  coefficients  were  estimated  from  the  data.  For  dose  of  estrogen, 
for  example,  we  code  the  effect  as  0  (placebo),  —0.00025  (0.2  mg),  —0.416 
(1.0  mg),  and  —0.111  (5.0  mg),  and  age  is  transformed  using  its  fitted  spline 
function.  In  the  rms  package  the  predict  function  easily  summarizes  multiple 
terms  and  produces  a  matrix  (here,  z)  containing  the  total  effects  for  each 
predictor.  Matrix  factors  can  easily  be  included  in  model  formulas. 

L 

z  V-  predict (f,  type  =  '  terms  '  ) 

#  required  x=T  above  to  store  design  matrix 
f . short  V-  cph(S  ~  z,  x=TRUE ,  y=TRUE) 

#  store  raw  x,  y  so  can  get  residuals 

The  fit  f. short  based  on  the  matrix  of  single  d.f.  predictors  z  has  the 
same  LR  y2  of  118  as  the  fit  f,  but  with  a  falsely  low  11  d.f.  All  regression 
coefficients  are  unity. 

Now  we  compute  scaled  Schoenfeld  residuals  separately  for  each  predictor 
and  test  the  PH  assumption  using  the  “correlation  with  time”  test.  Also  plot 
smoothed  trends  in  the  residuals.  The  plot  method  for  cox.zph  objects  uses 
cubic  splines  to  smooth  the  relationship. 

.  L 

phtest  V-  cox . zph ( f . short  ,  transf orm=  '  identity  '  ) 
pht est 


rho 

chi  sq 

P 

rx 

0 . 10232 

4 . 00823 

0 . 0453 

age 

-0 . 05483 

1 . 05850 

0 . 3036 

wt 

0 .01838 

0 . 11632 

0 . 7331 

pf . coded 

-0 . 03429 

0 .41884 

0 .5175 

heart 

0 . 02650 

0 . 30052 

0 . 5836 

map 

0 . 02055 

0 . 14135 

0 . 7069 

hg 

-0 . 00362 

0 . 00511 

0 . 9430 

sg 

-0 . 05137 

0 . 94589 

0 . 3308 

sz 

-0 .01554 

0 . 08330 

0 . 7729 

ap 

0 .01720 

0 . 11858 

0 . 7306 

bm 

0 . 04957 

0 . 95354 

0 . 3288 

GLOBAL 

NA 

7 . 18985 

0 . 7835 

plot (phtest  ,  var='rx')  #  Figure  21.1 


Perhaps  only  the  drug  effect  significantly  changes  over  time  ( P  =  0.05  for 
testing  the  correlation  rho  between  the  scaled  Schoenfeld  residual  and  time), 
but  when  a  global  test  of  PH  is  done  penalizing  for  11  d.f.,  the  P  value  is 
0.78.  A  graphical  examination  of  the  trends  doesn’t  find  anything  interesting 
for  the  last  10  variables.  A  residual  plot  is  drawn  for  rx  alone  and  is  shown  in 
Figure  21.1.  We  ignore  the  possible  increase  in  effect  of  estrogen  over  time.  If 
this  non-PH  is  real,  a  more  accurate  model  might  be  obtained  by  stratifying 
on  rx  or  by  using  a  time  x  rx  interaction  as  a  time-dependent  covariable. 


21.4  Describing  Predictor  Effects 
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0  20  40  60 


Time 

Fig.  21.1  Raw  and  spline-smoothed  scaled  Schoenfeld  residuals  for  dose  of  estrogen, 
nonlinearly  coded  from  the  Cox  model  fit ,  with  =L  2  standard  errors. 


21.3  Testing  Interactions 

Note  that  the  model  has  several  insignificant  predictors.  These  are  not 
deleted,  as  that  would  not  improve  predictive  accuracy  and  it  would  make 
accurate  confidence  intervals  hard  to  obtain.  At  this  point  it  would  be  rea¬ 
sonable  to  test  prespecified  interactions.  Here  we  test  all  interactions  with 
dose.  Since  the  multiple  terms  for  many  of  the  predictors  (and  for  rx)  make 
for  a  great  number  of  d.f.  for  testing  interaction  (and  a  loss  of  power),  we  do 
approximate  tests  on  the  data-driven  coding  of  predictors.  P- values  for  these 
tests  are  likely  to  be  somewhat  anti-conservative. 

z.dose  z  [ , " rx " ]  #  same  as  saying  z [ , 1]  -  get  first  column 

z .  other  z[,-l]  #  all  but  the  first  column  of  z 

f  .  ia  «—  cph(S  dose  *  z. other)  #  Figure  21.4; 

latex (anova(f.ia)  ,  f  ile  =  '  '  ,  label  =  'tab: coxcase-anova2  ') 

The  global  test  of  additivity  in  Table  21.4  has  P  =  0.27,  so  we  ignore  the 
interactions  (and  also  forget  to  penalize  for  having  looked  for  them  below!). 


21.4  Describing  Predictor  Effects 

Let  us  plot  how  each  predictor  is  related  to  the  log  hazard  of  death,  including 
0.95  confidence  bands.  Note  in  Figure  21.2  that  due  to  a  peculiarity  of  the 
Cox  model  the  standard  error  of  the  predicted  X/3  is  zero  at  the  reference 
values  (medians  here,  for  continuous  predictors). 
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Table  21.4  Wald  Statistics  for  S 


x2  < 

d.f. 

P 

z.dose  (Factor-fHigher  Order  Factors) 

18.74 

11 

0.0660 

All  Interactions 

12.17 

10 

0.2738 

z. other  (Factor-fHigher  Order  Factors) 

125.89 

20  <  0.0001 

All  Interactions 

12.17 

10 

0.2738 

z.dose  x  z. other  (Factor-}- Fligher  Order  Factors) 

12.17 

10 

0.2738 

TOTAL 

129.10 

21  <  0.0001 

cc 

N 
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10.0  12.5  15.00  10  20  30  40  50 
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2 

1 
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bm 

pf.coded  rx 


5.0  mg  estrogen  - 
1 .0  mg  estrogen  - 
0.2  mg  estrogen  - 
placebo  - 


i 


-1.5  -1.0  -0.5  0.0  0.5  1.0 


-1.5  -1.0  -0.5  0.0  0.5  1.0 


log  Relative  Hazard 


A 

Fig.  21.2  Shape  of  each  predictor  on  log  hazard  of  death.  F-axis  shows  X/3,  but 
the  predictors  not  plotted  are  set  to  reference  values.  Note  the  highly  non-monotonic 
relationship  with  ap,  and  the  increased  slope  after  age  70  which  occurs  in  outcome 
models  for  various  diseases. 


21.5  Validating  the  Model 
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ggplot  (Predict (f)  ,  sepdiscrete  =  'vertical  '  , 
vnames =' names  ')  #  Figure  21.2 


nlevels  =4 


21.5  Validating  the  Model 

We  first  validate  this  model  for  Somers’  Dxy  rank  correlation  between  pre¬ 
dicted  log  hazard  and  observed  survival  time,  and  for  slope  shrinkage.  The 
bootstrap  is  used  (with  300  resamples)  to  penalize  for  possible  overfitting,  as 
discussed  in  Section  5.3. 

L 

set.seed(l)  #  so  can  reproduce  results 
v  V-  validate (f ,  B=300) 

Divergence  or  singularity  in  83  samples 
latex (v,  f ile  =  '  '  ) 


Index  Original  Training  Test  Optimism  Corrected  n 
Sample  Sample  Sample  Index 


Dxy 

0.3208 

0.3454 

0.2954 

0.0500 

0.2708  217 

R2 

0.2101 

0.2439 

0.1754 

0.0685 

0.1417  217 

Slope 

1.0000 

1.0000 

0.7941 

0.2059 

0.7941  217 

D 

0.0292 

0.0348 

0.0238 

0.0110 

0.0182  217 

u 

-0.0005 

-0.0005 

0.0023 

-0.0028 

0.0023  217 

Q 

0.0297 

0.0353 

0.0216 

0.0138 

0.0159  217 

9 

0.7174 

0.7918 

0.6273 

0.1645 

0.5529  217 

Here  “training”  refers  to  accuracy  when  evaluated  on  the  bootstrap  sample 
used  to  fit  the  model,  and  “test”  refers  to  the  accuracy  when  this  model  is 
applied  without  modification  to  the  original  sample.  The  apparent  Dxy  is 
0.32,  but  a  better  estimate  of  how  well  the  model  will  discriminate  prognoses 
in  the  future  is  Dxy  =  0.27.  The  bootstrap  estimate  of  slope  shrinkage  is  0.79, 
close  to  the  simple  heuristic  estimate.  The  shrinkage  coefficient  could  easily 
be  used  to  shrink  predictions  to  yield  better  calibration. 

Finally,  we  validate  the  model  (without  using  the  shrinkage  coefficient)  for 
calibration  accuracy  in  predicting  the  probability  of  surviving  five  years.  The 
bootstrap  is  used  to  estimate  the  optimism  in  how  well  predicted  five-year 
survival  from  the  final  Cox  model  tracks  flexible  smooth  estimates,  with¬ 
out  any  binning  of  predicted  survival  probabilities  or  assuming  proportional 
hazards. 
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cal  V-  calibrate (f ,  B=300 , 

L 

u=5*12,  maxdim=4) 

Using  Cox  survival  estimates  at 

60  Months 

plot (cal,  subt it les =FALSE ) 

#  Figure  21.3 

Fig.  21.3  Bootstrap  estimate  of  calibration  accuracy  for  5-year  estimates  from  the 
final  Cox  model,  using  adaptive  linear  spline  hazard  regression361.  The  line  nearer  the 
ideal  line  corresponds  to  apparent  predictive  accuracy.  The  blue  curve  corresponds  to 
bootstrap-corrected  estimates. 


The  estimated  calibration  curves  are  shown  in  Figure  21.3,  similar  to  what 
was  done  in  Figure  19.11.  Bootstrap  calibration  demonstrates  some  overfit¬ 
ting,  consistent  with  regression  to  the  mean.  The  absolute  error  is  appreciable 
for  5-year  survival  predicted  to  be  very  low  or  high. 


21.6  Presenting  the  Model 

To  present  point  and  interval  estimates  of  predictor  effects  we  draw  a  hazard 
ratio  chart  (Figure  21.4),  and  to  make  a  final  presentation  of  the  model 
we  draw  a  nomogram  having  multiple  “predicted  value”  axes.  Since  the  ap 
relationship  is  so  non-monotonic,  use  a  20  :  1  hazard  ratio  for  this  variable. 

L 

plot ( summary  (f,  ap  =  c(l,20)),  log  =  TRUE ,  main=  1  1  )  #  Figure  21.4 


21.7  Problems 
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age  -  76:70 
wt-  107:90 
pf. coded  -  4:1 
heart  -  2:0 
map  -  11:9.333333 
hg  -  14.69922:12.29883 

sg  -  1 1 :9 
sz  -  21 :5 
ap  -  20:1 
bm  -  1:0 

rx  -  0.2  mg  estrogen:placebo 
rx  -  1 .0  mg  estrogemplacebo 
rx  -  5.0  mg  estrogen:placebo 


0.50  1.00  2.00  3.50  5.50 


Fig.  21.4  Hazard  ratios  and  multi-level  confidence  bars  for  effects  of  predictors  in 
model,  using  default  ranges  except  for  ap 


The  ultimate  graphical  display  for  this  model  will  be  a  nomogram  relating 

/\ 

the  predictors  to  X/3,  estimated  three-  and  five-year  survival  probabilities 
and  median  survival  time.  It  is  easy  to  add  as  many  “output”  axes  as  desired 
to  a  nomogram. 
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#  Figure 

21.5 

21.7  Problems 

Perform  Cox  regression  analyses  of  survival  time  using  the  Mayo  Clinic  PBC 
dataset  described  in  Section  8.9.  Provide  model  descriptions,  parameter  esti¬ 
mates,  and  conclusions. 

1.  Assess  the  nature  of  the  association  of  several  predictors  of  your  choice. 
For  polytomous  predictors,  perform  a  log-rank-type  score  test  (or  /c-sample 
ANOVA  extension  if  there  are  more  than  two  levels).  For  continuous  pre¬ 
dictors,  plot  a  smooth  curve  that  estimates  the  relationship  between  the 
predictor  and  the  log  hazard  or  log-log  survival.  Use  both  parametric 
and  nonpar ametric  (using  martingale  residuals)  approaches.  Make  a  test 
of  H0  :  predictor  is  not  associated  with  outcome  versus  Ha  :  predictor 
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Points 
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Fig.  21.5  Nomogram  for  predicting  death  in  prostate  cancer  trial 


is  associated  (by  a  smooth  function).  The  test  should  have  more  than  1 
d.f.  If  there  is  no  evidence  that  the  predictor  is  associated  with  outcome. 
Make  a  formal  test  of  linearity  of  each  remaining  continuous  predictor. 
Use  restricted  cubic  spline  functions  with  four  knots.  If  you  feel  that  you 
can’t  narrow  down  the  number  of  candidate  predictors  without  examining 
the  outcomes,  and  the  number  is  too  great  to  be  able  to  derive  a  reliable 
model,  use  a  data  reduction  technique  and  combine  many  of  the  variables 
into  a  summary  index. 
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2.  For  factors  that  remain,  assess  the  PH  assumption  using  at  least  two  meth¬ 
ods,  after  ensuring  that  continuous  predictors  are  transformed  to  be  as 
linear  as  possible.  In  addition,  for  polytomous  predictors,  derive  log  cu¬ 
mulative  hazard  estimates  adjusted  for  continuous  predictors  that  do  not 
assume  anything  about  the  relationship  between  the  polytomous  factor 
and  survival. 

3.  Derive  a  final  Cox  PH  model.  Stratify  on  polytomous  factors  that  do  not 
satisfy  the  PH  assumption.  Decide  whether  to  categorize  and  stratify  on 
continuous  factors  that  may  strongly  violate  PH.  Remember  that  in  this 
case  you  can  still  model  the  continuous  factor  to  account  for  any  residual 
regression  after  adjusting  for  strata  intervals.  Include  an  interaction  be¬ 
tween  two  predictors  of  your  choosing.  Interpret  the  parameters  in  the  final 
model.  Also  interpret  the  final  model  by  providing  some  predicted  survival 
curves  in  which  an  important  continuous  predictor  is  on  the  x-axis,  pre¬ 
dicted  survival  is  on  the  y- axis,  separate  curves  are  drawn  for  levels  of 
another  factor,  and  any  other  factors  in  the  model  are  adjusted  to  speci¬ 
fied  constants  or  to  the  grand  mean.  The  estimated  survival  probabilities 
should  be  computed  at  t  =  730  days. 

4.  Verify,  in  an  unbiased  fashion,  your  “final”  model,  for  either  calibration  or 
discrimination.  Validate  intermediate  steps,  not  just  the  final  parameter 
estimates. 


Appendix  A 

Datasets,  R  Packages,  and  Internet 
Resources 


Central  Web  Site  and  Datasets 

The  web  site  for  information  related  to  this  book  is  biostat .  me  .  vanderbilt . 
edu/rms,  and  a  related  web  site  for  a  full-semester  course  based  on  the  book  is 
http://biostat.mc.vanderbilt.edu/CourseBios330.  The  main  site  con¬ 
tains  links  to  several  other  web  sites  and  a  link  to  the  dataset  repository  that 
holds  most  of  the  datasets  mentioned  in  the  text  for  downloading.  These 
datasets  are  in  fully  annotated  R  save  (.sav  suffixes)  hlesa;  some  of  these 
are  also  available  in  other  formats.  The  datasets  were  selected  because  of 
the  variety  of  types  of  response  and  predictor  variables,  sample  size,  and 
numbers  of  missing  values.  In  R  they  may  be  read  using  the  load  function, 
load(uriO)  to  read  directly  from  the  Web,  or  by  using  the  Hmisc  package’s 
getHdata  function  to  do  the  same  (as  is  done  in  code  in  the  case  studies). 
From  the  web  site  there  are  links  to  other  useful  dataset  sources.  Links  to 
presentations  and  technical  reports  related  to  the  text  are  also  found  on  this 
site,  as  is  information  for  instructors  for  obtaining  quizzes  and  answer  sheets, 
extra  problems,  and  solutions  to  these  and  to  many  of  the  problems  in  the 
text.  Details  about  short  courses  based  on  the  text  are  also  found  there.  The 
main  site  also  has  Chapter  7  from  the  first  edition,  which  is  a  case  study  in 
ordinary  least  squares  modeling. 


R  Packages 

The  rms  package  written  by  the  author  maintains  detailed  information  about 
a  model’s  design  matrix  so  that  many  analyses  using  the  model  fit  are  au¬ 
tomated.  rms  is  a  large  package  of  R  functions.  Most  of  the  functions  in  rms 
analyze  model  fits,  validate  them,  or  make  presentation  graphics  from  them, 

a  By  convention  these  should  have  had  .rda  suffixes. 
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but  the  packages  also  contain  special  model-fitting  functions  for  binary  and 
ordinal  logistic  regression  (optionally  using  penalized  maximum  likelihood), 
unpenalized  ordinal  regression  with  a  variety  of  link  functions,  penalized  and 
unpenalized  least  squares,  and  parametric  and  semiparametric  survival  mod¬ 
els.  In  addition,  rms  handles  quantile  regression  and  longitudinal  analysis 
using  generalized  least  squares.  The  rms  package  pays  special  attention  to 
computing  predicted  values  in  that  design  matrix  attributes  (e.g.,  knots  for 
splines,  categories  for  categorical  predictors)  are  “remembered”  so  that  pre¬ 
dictors  are  properly  transformed  while  predictions  are  being  generated.  The 
functions  makes  extensive  use  of  a  wealth  of  survival  analysis  software  writ¬ 
ten  by  Terry  Therneau  of  the  Mayo  Foundation.  This  survival  package  is  a 
standard  part  of  R. 

The  author’s  Hmisc  package  contains  other  miscellaneous  functions  used 
in  the  text.  These  are  functions  that  do  not  operate  on  model  fits  that  used 
the  enhanced  design  attributes  stored  by  the  rms  package.  Functions  in  Hmisc 
include  facilities  for  data  reduction,  imputation,  power  and  sample  size  calcu¬ 
lation,  advanced  table  making,  recoding  variables,  translating  SAS  datasets 
into  R  data  frames  while  preserving  all  data  attributes  (including  variable 
and  value  labels  and  special  missing  values),  drawing  and  annotating  plots, 
and  converting  certain  R  objects  to  LTj^X371  typeset  form.  The  latter  capa¬ 
bility,  provided  by  a  family  of  latex  functions,  completes  the  conversion  to 
IATf^X  of  many  of  the  objects  created  by  rms.  The  packages  contain  several 
IATf^X  methods  that  create  IATf^X  code  for  typesetting  model  fits  in  algebraic 
notation,  for  printing  ANOVA  and  regression  effect  (e.g.,  odds  ratio)  tables, 
and  other  applications.  The  IATrX  methods  were  used  extensively  in  the  text, 
especially  for  writing  restricted  cubic  spline  function  fits  in  simplest  notation. 

The  latest  version  of  the  rms  package  is  available  from  CRAN  (see  below). 
It  is  necessary  to  install  the  Hmisc  package  in  order  to  use  rms  package.  The 
Web  site  also  contains  more  in-depth  overviews  of  the  packages,  which  run  on 
UNIX,  Linux,  Mac,  and  Microsoft  Windows  systems.  The  packages  may  be 
automatically  downloaded  and  installed  using  R’s  install. packages  function 
or  using  menus  under  R  graphical  user  interfaces. 


R-help,  CRAN,  and  Discussion  Boards 

To  subscribe  to  the  highly  informative  and  helpful  R-help  e-mail  group,  see  the 
Web  site.  R-help  is  appropriate  for  asking  general  questions  about  R  including 
those  about  finding  or  writing  functions  to  do  specific  analyses  (for  questions 
specific  to  a  package,  contact  the  author  of  that  package).  Another  resource 
is  the  CRAN  repository  at  www.r-project.org.  Another  excellent  resource 
for  askings  questions  about  R  is  stackoverf  low .  com/questions/tagged/r. 
There  is  a  Google  group  regmod  devoted  to  the  book  and  courses. 
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Multiple  Imputation 

The  Impute  E-mail  list  maintained  by  Juned  Siddique  of  Northwestern  Univer¬ 
sity  is  an  invaluable  source  of  information  regarding  missing  data  problems. 
To  subscribe  to  this  list,  see  the  Web  site.  Other  excellent  sources  of  on¬ 
line  information  are  Joseph  Schafer’s  “Multiple  Imputation  Frequently  Asked 
Questions”  site  and  Stef  van  Buuren  and  Karin  Oudshoorn’s  “Multiple  Im¬ 
putation  Online”  site,  for  which  links  exist  on  the  main  Web  site. 


Bibliography 

An  extensive  annotated  bibliography  containing  all  the  references  in  this  text 
as  well  as  other  references  concerning  predictive  methods,  survival  analysis, 
logistic  regression,  prognosis,  diagnosis,  modeling  strategies,  model  valida¬ 
tion,  practical  Bayesian  methods,  clinical  trials,  graphical  methods,  papers 
for  teaching  statistical  methods,  the  bootstrap,  and  many  other  areas  may 
be  found  at  http://www.citeulike.org/user/harrelfe. 


SAS 

SAS  macros  for  fitting  restricted  cubic  splines  and  for  other  basic  operations 
are  freely  available  from  the  main  Web  site.  The  Web  site  also  has  notes  on 
SAS  usage  for  some  of  the  methods  presented  in  the  text. 
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noncompliance,  402,513 
nonignorable  nonresponse,  see 
missing  data 
nonparametric 
correlation,  66 
censored  data,  517 
generalized  Spearman 
correlation,  66,  376 
independence  test,  129, 166 
regression,  29,41,  105,142,245, 
285 

test,  2,  66, 129 

nonproportional  hazards,  495 
npsurv,  418,419 
ns,  132, 133 

nuisance  parameter,  190, 191 

O 

object-oriented  program,  x,  127, 
133 

observational  study,  3,  58, 

230,400 

odds  ratio,  222,224,318 
OLS  ,  see  linear  model 
ois ,  131,135,  137,350,351, 
448,469,470 

optimism,  109,111,114,  391 
ordered,  133 

ordinal  model,  311,  359,361-363, 
370,371 

case  study,  327-356,359-387 
probit,  364 

ordinal  response,  see  response 
ordinality,  see  assumptions 
orm,  131,135,  319,362,363 
outlier,  116,294 
overadjustment,  2 
overfitting,  72,  109—110 

P 

parsimony,  87,  97, 119 
partial  effect  plot,  104,318 
partial  residual,  see  residual 
partial  test,  see  hypothesis  test 
PC  ,  see  principal  component, 
170,172,175,  275 
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pcaPP  package,  175 
pec  package,  519 
penalized  maximum  likelihood, 
see  maximum  likelihood 
pentrace,  134,136,  269,323,342, 
344 

person-years,  408,425 

plclust,  129 

plot . lrm. partial,  339 

plot  .xmean.  ordinaly,  319,  323,333 

plsmo,  358 

Poisson  model,  271 

pol,  133 

poly,  132, 133 

polynomial,  21 

popower,  319 

posamsize,  319 

power  calculation,  see  cpower, 
spower,  ciapower,  popower 

pphsm,  448 
prcomp,  141 

preconditioning,  118,123 
predab . resample,  141,  269,323 
Predict,  130,  134,  136,  149, 

198,199,202,  278,299,307, 
319,448,466 

predict,  127, 132,  136, 140, 309, 
319,469,517,  526 
predictor 

continuous,  21,40 
nominal,  16,210 
ordinal,  38 

principal  component,  81,87, 
101,275 

sparse,  101, 175 
princomp,  141,171 
PRINQUAL,  82,83 
product-limit  estimator,  see 
survival  function 
propensity  score,  3,58,  231 
proportional  hazards  model,  see 
Cox  model 

proportional  odds  model,  see 
logistic  model 


prostate,  see  datasets 
psm,  131,135,  448,448, 
460,464,513 

Q 

Q-R  decomposition,  23 
Q-Q  plot,  148 
qr,  192 

Quantile,  135,448,  472,513,514 
quantile  regression,  359,  360, 364, 
370,379,392 
composite,  361 
quantreg,  131,360 

R 

random  forests,  100 
rank  correlation,  see 
nonparametric 
Rao  score  test,  186—187, 
191,193-195,  198 

rcorr,  166 

rcorr . cens,  142,461,  517 
rcorrcens,  461 
rcorrp . cens,  142 
res,  133,296,297 
respline . eval,  129 
respline .plot,  273 
respline . restate,  129 
receiver  operating  characteristic 
curve,  6,  11 

area,  92,93,111,  257,346 
area,  generalized,  318,  505 
recursive  partitioning,  10,  30,31, 
41,46,47,  51,52,83,87, 
100,120,142,  302,349 
redun,  80,463 

redundancy  analysis,  80,  175 
regression  to  the  mean,  75,530 
resampling,  105,112 
resid,  134,336,337,  460,516 
residual 

logistic  score,  314,336 
martingale,  487,493,494, 
515,516 

partial,  34,272,  315,321,337 
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Schoenfeld  score,  314,  487, 
498,499,516,517,  525,526 
residuals,  132,134,  269,336,337, 
460,516 

residuals . coxph,  516 

response 

binary,  219-221 
censored  or  truncated,  401 
continuous,  389—398 
ordinal,  311,327,  359 
restricted  cubic  spline,  see  spline 
function 

ridge  regression,  77,115,  209,210 
risk  difference,  224, 430 
risk  ratio,  224, 430 
rms  package,  xi,  129,  130—141, 
149,192,193,  198,199,211, 
214,319,  362,363,418, 
422,535 

robcov,  134, 135,  198, 202 
robust  covariance  estimator,  see 
variance-covariance  matrix 
robustgam  package,  390 
ROC,  see  receiver  operating 
characteristic  curve,  105 
rpart,  142,302,303 
Rq,  131,135,  360 
rq,  131 
runif,  460 

s 

sample  size,  73,  74, 148, 
233,363,486 

sample  survey,  135,197,  208,417 
sas.get,  129 
sascode,  138 
scientific  quantity,  20 
score  function,  182, 183, 186 
score  test,  see  Rao  score  test, 
235,363 

score .binary,  86 

scored,  132,  133 
scoring,  hierarchical,  86 
scree  plot,  172 


semiparametric  model,  311,359, 
361-363,370,371,  475 
sensuc,  134 

shrinkage,  75-78, 87,  88, 
209-212,342-348 
similarity  measure,  81,330,  458 
smearing  estimator,  see  estimator 
smoother,  390 

Somers’  rank  correlation,  see  Dxy 
somers2,  346 
spca  package,  175 
sPCAgrid,  175,  179 
Spearman  rank  correlation,  see 
nonparametric 
spearman2,  129,460 
specs,  134,  135 
spline  function,  22,30, 
167,192,393 

B-spline,  23,41,  132,500 
cubic,  23 
linear,  22, 133 
normalization,  26 
restricted  cubic,  24-28 
tensor,  37,  247,  374, 375 
spower,  513 

standardized  regression 
coefficient,  103 
state  transition,  416,420 
step,  134 
step  halving,  196 
strat,  133 
strata,  133 
strategy,  63 

comparing  models,  92 
data  reduction,  79 
describing  model,  103,318 
developing  imputations,  49 
developing  model  for  effect 
estimation,  98 
developing  models  for 
hypothesis  testing,  99 
developing  predictive  model,  95 
global,  94 
in  a  nutshell,  ix,  95 
influential  observations,  90 
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maximum  number  of 
parameters,  72 

model  approximation,  118,275, 
287 

multiple  imputation,  53 
prespecification  of  complexity, 
64 

shrinkage,  77 
validation,  109,110 
variable  selection,  63,  67 
stratification,  225,237,238,  254, 
418,419,  481-483,  488 
subgroup  estimates,  34,241,  400 
summary,  127,  130,  134,  136,  149, 
167,198,199,  201,278,292, 
466 

summary . formula,  302,  319,357 
summary. gls,  149 
super  smoother,  29 
SUPPORT  study,  see  datasets 
suppression,  101 
supsmu,  141,273,  390 
Surv,  172,418,  422,458,516 
survConcordance,  517 
survdiff,  517 
survest,  135,448 
survfit,  135,  418,419 
Survival,  135,448,  513,514 
survival  function 

Aalen  estimator,  412,413 
Breslow  estimator,  485 
crude,  416 

Fleming-Harrington  estimator, 
412,413,  485 

Kalbfleisch-Prentice  estimator, 
484,485 

Kaplan-Meier  estimator, 
409-413,  414-416,420 
multiple  state  estimator,  416, 
420 

Nelson  estimator,  412,413,418, 
485 

standard  error,  412 
survival  package,  131, 

418,422,499,  513,517,536 


survpiot,  135,419,  448,458,460 
survreg,  131, 448 
survreg. auxinf o,  449 
survreg. distributions,  449 

T 

test  of  linearity,  see  hypothesis 
test 

test  statistic,  see  hypothesis  test 
time  to  event,  399 

and  severity  of  event,  417 
time-dependent  covariable, 
322,418,  447,499-503, 
513,518,526 
Titanic,  see  datasets 
training  sample,  111-113,122 
transace,  176,  177 
transcan,  51,55,  80,83, 

83-85,129,135,  138,167, 
170-172,175-177, 
276,277,330,  334,335,521, 
525 

transform  both  sides  regression, 
176,  389,392 

transformation,  389, 393,  395 
post,  133 
pre,  179 

tree  model,  see  recursive 
partitioning 
truncation,  401 

U 

unconditioning,  119 
uniqueness  analysis,  94 
uni  variable  screening,  72 
univarLR,  134, 135 
unsupervised  learning,  79 

V 

val.prob,  109,135,  271 
val.surv,  109,449,  517 
validate,  135,  141,  142, 

260,269,271,  282,286, 
300,301,319,  323,354,466, 
517 


582 


Index 


validation  of  model,  109—116, 
259,299,318,  322,353,446, 
466,506,529 
bootstrap,  114-116 
cross,  113,115,116,  210 
data-splitting,  111,  112,271 
external,  109,110,237, 
271,449,517 
leave-out-one,  113,122, 

215.255 

quantities  to  validate,  110 
randomization,  113 
varcius,  79,129,  167,330,458, 

463 

variable  selection,  67—72,  171 
step-down,  70, 137, 

275,280,282,  286,377 
variance  inflation  factors,  79, 135, 

138.255 

variance  stabilization,  390 


variance-covariance  matrix, 

51,54,  120,129,189, 

191,193,  196-198,208, 
211,215 

cluster  sandwich,  197,  202 
Huber-White  estimator,  147 
sandwich,  147,211,  217 
variogram,  148, 153 
vcov,  134,135 
vif,  135,138 

W 

waiting  time,  401 
Wald  statistic,  186,  189,191,192, 
194,196,198,  206,244,  278 
weighted  analysis,  see  maximum 
likelihood 

which,  inf  luence,  134,  137,269 
working  independence  model,  197 


